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Foreword 



The papers contained in this volume were presented at the third international 
Workshop on Implementing Automata, held September 17-19, 1998, at the Uni- 
versity of Rouen, France. 

Automata theory is the cornerstone of computer science theory. While there 
is much practical experience with using automata, this work covers diverse ar- 
eas, including parsing, computational linguistics, speech recognition, text search- 
ing, device controllers, distributed systems, and protocol analysis. Consequently, 
techniques that have been discovered in one area may not be known in another. 
In addition, there is a growing number of symbolic manipulation environments 
designed to assist researchers in experimenting with and teaching on automata 
and their implementation; examples include FLAP, FADELA, AMORE, Fire- 
Lite, Automate, AGL, Turing’s World, FinITE, INR, and Grail. Developers of 
such systems have not had a forum in which to expose and compare their work. 
The purpose of this workshop was to bring together members of the academic, 
research, and industrial communities with an interest in implementing automata, 
to demonstrate their work and to explain the problems they have been solving. 

These workshops started in 1996 and 1997 at the University of Western 
Ontario, London, Ontario, Canada, prompted by Derick Wood and Sheng Yu. 
The major motivation for starting these workshops was that there had been 
no single forum in which automata-implementation issues had been discussed. 
The interest shown in the first and second workshops demonstrated that there 
was a need for such a forum. The participation at the third workshop was very 
interesting: we counted sixty-three registrations, four continents, ten countries, 
twenty-three universities, and three companies. 

The general organization and orientation of WIA conferences is governed by 
a steering committee composed of Jean-Marc Champarnaud, Stuart Margolis, 
Denis Maurel, and Sheng Yu, with Derick Wood as chair. The WIA 1999 meeting 
will be held at the University of Potsdam, Germany, and the 2000 meeting in 
London, Ontario. 
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Abstract. We investigate the complexity of a variety of normal- form 
transformations for extended context-free grammars, where by extended 
we mean that the set of right-hand sides for each nonterminal in such a 
grammar is a regular set. The study is motivated by the implementation 
project GraMa which will provide a C-I--I- toolkit for the symbolic ma- 
nipulation of context-free objects just as Grail does for regular objects. 
The results are that all transformations of interest take time linear in the 
size of the given grammar giving resulting grammars that are larger by a 
constant factor than the original grammar. Our results generalize known 
bounds for context-free grammars but do so in nontrivial ways. Specifi- 
cally, we introduce a new representation scheme for extended context-free 
grammars (the symbol-threaded expression forest), a new normal form 
for these grammars (dot normal form) and new regular expression algo- 
rithms. 



1 Introduction 

In the 1960’s, extended context-free grammars were introduced, using Backus- 
Naur form, as a useful abbreviatory notation that made context-free gram- 
mars easier to write. More recently, the Standardized General Markup Language 
(SGML) used a similar abbreviatory notation to define extended context-free 
grammars for documents. Gurrently, XML jS|, which is a simplified version of 
SGML, is being promoted as the markup language for the web, instead of HTML 

* This research was supported under a grant from the Research Grants Gouncil of 
Hong Kong SAR. It was carried out while the first and second authors were visiting 
HKUST. 
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(a specific grammar or DTD specified using SGML). These developments led to 
the investigation of how notions applicable to context-free grammars could be 
carried over to extended context-free grammars. There does not appear to have 
been any consolidated effort to study extended context-free grammars in their 
own right. We begin such an investigation with the most basic problems for 
extended context-free grammars: reduction and normal-form transformations. 
There has been some related work that is more directly motivated by SGML is- 
sues; see the proof of decidability of structural equivalence for extended context- 
free grammars and the demonstration that SGML exceptions do not add 
expressive power to extended context-free grammars HH 

We are currently designing a manipulation system toolkit GraMa for ex- 
tended context-free grammars, pushdown machines, and context-free expres- 
sions. It is an extension of Grail |1 MI1 7j . a similar toolkit for regular expressions 
and finite-state machines. As a result, we need to choose appropriate representa- 
tions of grammars and machines that admit efficient transformation algorithms 
(as well as other algorithms of interest) . More specifically we study the extensions 
specified by regular expressions as used in SGML and XML jS| DTDs. 

Earlier results on context-free grammars were obtained by Harrison and 
Yehudai mrm and by Hunt et al. HB| among others. Harrrison’s chapter mu 
on normal form transformations provides an excellent survey of early results. 

Gohen and Gotlieb suggested a specific representation for context-free 
grammars and demonstrated how it aided the programming of various operations 
on them. 

We first define extended context-free grammars using the notion of produc- 
tion schemas that are based on regular expressions. In a separate paper 0, we 
discuss the algorithmic effects of basing the schemas on finite-state machines. As 
finite-state machines and regular expressions are both first-class objects in Grail, 
they can be used interchangeably and we expect that they will be in GraMa. 

We then describe algorithms for the fundamental normal-form transforma- 
tions in Section 3. Before doing so, we propose a representation for extended 
context-free grammars as regular expression forests with symbol threads. We 
then discuss some algorithmic problems for regular expressions before tackling 
the various normal forms. When we discuss unit-production removal we will 
modify this representation somewhat to ensure that we avoid blow-up in the 
size of the target grammar. 



2 Notation and Terminology 

We treat extended context-free grammars as context-free grammars in which 
the right-hand sides of productions are regular expressions. Let V be an alpha- 
bet. Then, we define a regular expression over V and its language in the usual 
way 1 1 1'J.Jj with the Kleene plus as an additional operator. The symbol A denotes 
the null string. We denote hyVE the set of symbols of V that appear in a regular 
expression E. 
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An extended context-free grammar G is specified by a tuple {N, a, P, S), 
where N and a are disjoint finite alphabets of nonterminal symbols and 
terminal symbols, respectively, P is a finite set of production schemas, 
and the nonterminal S is the sentence symbol. Each production schema has 
the form A->-Ea, where A is a nonterminal and Ea is a regular expression over 
V = N U a. When /3 = /3iA/?2 G V*, for some nonterminal A, A—>Ea G P, and 
a G L{E), the string Piaf32 can be derived from the string [3 and we denote this 
fact by writing /3=>/3ia/32- The language L(G) of an extended context-free 
grammar G is the set of terminal strings derivable from the sentence symbol 
of G. Formally, L{G) = {w G a* \ where =>“*■ denotes the transitive 

closure of the derivability relation. 

Even though a production schema may correspond to an infinite number 
of ordinary context-free productions, it is known that extended and standard 
context-free grammars describe exactly the same languages; for example, see the 
texts of Salomaa [zq and of Wood m 

As is standard in the literature, we denote the size of a regular expression E 
by \E\ and define it as the number of symbols and metasymbols in E. We denote 
the size of a set A by ^A. To measure the complexity of any grammatical 
transformation we need to define the size of a grammar. There are two traditional 
measures of the size of a context-free grammar that we extend to extended 
context-free grammars as follows. Given an extended context-free grammar G = 
{N, a, P, S), we define the size of G to be 

AgN 

and we denote it by |G|. We define the norm of G to be 
|G|log#(lVUa) 

and we denote it by || G |j. Clearly, the norm is a more realistic measure of a 
grammar’s size as it takes into account the size of the encoding of the symbols 
of the grammar but we use only the size measure here. 

3 Normal-Form Transformations 

We need the notion of an expression forest that represents the regular expressions 
that appear as right-hand sides of production schema. Each production schema’s 
right-hand side is represented as an expression tree in the usual way, internal 
nodes are labeled with operators and external nodes are labeled with symbols. 
In addition, we represent the nonterminal left-hand side of a production schema 
with a single node labeled with that nonterminal. The node also refers to the 
root of the expression tree of its corresponding right-hand side. In Fig. I, we give 
an example of forest for two regular expressions. 

As an extended context-free grammar has a number of production schemas 
that are regular expressions, we represent such grammars as an expression forest, 
where each tree in the forest corresponds to one production schema and each 
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tree is named by its corresponding nonterminal. (The naming avoids the tree 
repetition problem.) We now add threads such that the thread for symbol X 
connects all appearances of the symbol X in the expression trees. 



3.1 Reachability and Usefulness 

A symbol X is reachable if it appears in some string derived from the sentence 
symbol; that is, if there is a derivation S^~^aXI3 where a and (3 are possibly 
null strings over cr U iV. 

As in standard context-free grammars, reachable symbols can be easily iden- 
tified by means of a digraph traversal. More precisely, we construct a digraph 
whose vertices are symbols in W U cr and there is an edge from A to if and only 
if B labels an external node of the expression tree named A. (We assume that 
the regular expressions do not contain the empty-set symbol.) Then, a depth- 
first traversal of this digraph starting from S gives all reachable symbols of the 
grammar. The time taken by the traversal and digraph construction is linear in 
the size of the grammar. 

A nonterminal symbol A is useful if there is a derivation A=>^a, where 
a is a terminal string. The set of useful symbols can be computed recursively 
as follows. We first find all symbols B such that L{Eb) contains a string of 
terminal symbols (possibly the null string). All these B are useful symbols. 
Then, a symbol A is useful if L{Ea) contains a string of terminals and the 
currently detected reachable symbols, and so on until no newly useful symbols are 
identified. We formalize this recursive process with a marking algorithm such as 
described by Wood m for context-free grammars. The major difference between 
previous work and the approach taken here is that we want to obtain an efficient 
algorithm. Yehudai |2H designed an efficient algorithm for determining usefulness 
for context-free grammars; our approach can be viewed as a generalization of his 
work. 

To explain the marking algorithm, we assume that we have one bit available 
at each node of the expression forest to indicate the marking. We initialize these 
bits in a preorder traversal of the forest as follows: The bits of all nodes are set to 
zero (unmarked) except for nodes that are labeled with Kleene star, a terminal 
symbol or the null string — the bits of these nodes are set to one (marked). In 
the algorithm, whenever a node u is marked (that is, it is useful), it satisfies 
the condition: The language of the subtree rooted at u eontains a string that is 
eompletely marked. A *-node is marked since its subtree’s language contains the 
null string; therefore, a *-node is always useful. 

After completing the initial marking, we propagate markings up the trees in 
a propagation phase as follows: Repeatedly examine newly marked nodes until 
no newly marked nodes are obtained. For each newly marked node u, where p(u) 
is u’s parent if u has one, perform one of the following actions: 

p{u) is a -I— node and p{u) is not marked, then mark p{u). 

p(u) is a *-node, p{u) is not marked and u’s sibling is marked, then markp(u). 

p{u) is a *-node, then do nothing. 
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M is a root node and the expression tree’s nonterminal symbol is not marked, 
then mark the expression tree’s nonterminal symbol. 

If there are newly marked nonterminals after this initial round, then we 
mark all their appearances in the expression forest; otherwise, we terminate the 
algorithm. We now repeat the propagation phase which propagates the markings 
of the newly marked symbols up the trees. 

The algorithm has, therefore, a number of rounds and at the beginning of each 
round it marks all appearances of newly discovered useful symbols (discovered 
in the previous round) and then carries out the propagation phase. As long 
as a round discovers newly marked nonterminals, the process is repeated. To 
implement this process efficiently, we construct, at the beginning of each round, a 
queue of newly marked nodes. Note that the queue is a catenation of appearance 
lists after the first round. The algorithm then repeatedly deletes an appearance 
from the queue and, using the preceding propagation rules, it may also append 
new appearances to the queue. A round terminates when the queue is empty. 
We can exploit the representation of the grammar by an expression forest to 
ensure that this algorithm runs in time linear in |G|. The idea is to make the 
test on L{Ea) a pruning operation in the corresponding expression tree. More 
precisely, each time we test an L{Ea), we delete part of its expression tree; 
once we “delete” the root, we will have discovered that A is useful. Since the 
algorithm actually deletes portions of the trees, we apply it to a copy of the 
expression forest. We first preprocess the forest to prune all subtrees that have 
Kleene star as their root label. More precisely, we perform a preorder traversal 
of each tree and whenever we find a *-node we replace it and its subtree with 
a A-node. Then, we perform a second pruning in a bottom-up fashion starting 
from external nodes labeled with the null string, terminal symbols or currently 
detected useful symbols. This operation is defined recursively as follows. Let x 
be a node of an expression tree and let p{x) be its parent. If p{x) is a -I— node, 
then delete the subtree rooted at p{x) and recursively apply pruning at p{p{x)); 
if p{x) is a -node, then replace it and its subtree with the sibling of x and its 
subtree, and recursively apply pruning at this node. The overall time for pruning 
is linear in |G| since we can use the threads to find all appearances of the symbol 
we want to prune in the expression forest. Note that each node of the expression 
forest is visited at most twice. Thus, the marking algorithm runs in 0(|G|) time 
and space. 

A grammar G is reduced if all its symbols are both reachable and useful. 
As in standard context-free grammars, to reduce a grammar we first identify all 
useful symbols and then select (together with the corresponding schemas) those 
that are also reachable. 

3.2 Null-Pree Form 

Since production schemas are not productions but only specify productions, we 
can convert an extended context-free grammar into a null-free extended context- 
free grammar much more easily than the corresponding conversion for context- 
free grammars. Given an extended context-free grammar G = {N,a, P, S), we 
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can determine the nullable nonterminals (the ones that derive the null string) 
using a similar algorithm to the one we used for usefulness, see Section tS.1L This 
algorithm takes 0(|G|) time. Given this information we can make the grammar 
null free in two steps. First, we replace all appearances of each nullable symbol 
A with the regular expression (A + A). This step takes time proportional to 
the total number of appearances of nullable symbols in G — we use the symbol 
threads for fast access to them. 

In the second step we transform each production schema A-^Ea, where A 
is nullable, into a null- free production schema A^E'^, where A ^ L(E'y^). This 
step takes time in the worst case. The exponential behavior is attained 

in a pathological worst case when we have nested dotted subexpressions in which 
each operand of the dot can produce the null string. We can replace each such 
pair F ■ G with FL -|- G_ -I- (FL • G_), where F!_ is the transformed version of F 
that does not produce the null string. Note that we double at least the length of 
the subexpression. As the doubling can occur in the subexpressions of F and G 
and of their subexpressions, we obtain the exponential worst-case bound. (Note 
that this is the same case that occurs when the grammar is context-free and 
every symbol is nullable.) 

We can improve the exponential worst-case bound by modifying the trans- 
formation of F • G to be F_ -I- {F ■ G_ ) . We can then prove that the size of the 
resulting expression is, in the worst case, quadratic in the size of the original 
expression. Indeed, we can apply Ziadi’s approach and replace F ■ G with 
F- + {F ■ G_), if |F| < |G|, or with G_ -I- (F_ • G), otherwise. In this case, we 
get a blowup of 0{n log n), where n is the size of the original expression. Observe 
that if we represent the regular expressions with finite-state machines, we would 
obtain at most quadratic blow up. To reduce the size of the resulting expression 
to linear, we have to use a different technique invented by We can, however, 
apply a similar sleight-of-hand technique Yehudai EEa. For standard context- 
free grammars, he reduces the lengths of the right-hand sides to at most two, so 
that the problem with catenation has only a constant blow up. 

We can replace each dotted subtree of an expression tree, that has a dotted 
ancestor, with a new nonterminal and add a new production schema to the 
grammar. We can repeat this local modification until no dotted subtree has a 
dotted subtree within it. Thus, the new grammar has size that is of the same 
order as the original size. For example, given the production schema 

A^(((a -I- 6 -I- A) • (6 -I- A)) • (c -I- d -I- A)) • (e -I- A)); 
we can replace it with new production schemas: 

A^{B-{c+d+\))-{e + \) 

and 

B—f{a + b+ X) ■ (b+ X). 

Repeating the transformation for A, we obtain 
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and 



C^B ■ (c -I- d + A). 

We say that the resulting grammar is in dot normal form. We can readily 
argue that the size of the resulting grammar is at most twice the size of the 
original size and it has increased the number of nonterminals by at most the size 
of the original grammar. 

Basing the preceding null-removal construction on grammars in dot normal 
form, ensures that it runs in time 0(|G|) time and produces a new grammar 
that has size 0(|G|). 



3.3 Unit- Free Form 

A unit production is a production of the form A-^B. Transforming an ex- 
tended context-free grammar into unit-free form requires three steps. First, we 
identify all unit-productions. Second, we transform the regular expressions (their 
representations as expression trees) such that their languages do not include 
unit nonterminal strings. Lastly, we modify the production schemas such that 
if A=>~^B {B is unit-reachable from A), then A^E'^^ + Eb where B is not in 
L{E'^). 

Given an extended context-free grammar G, for each nonterminal symbol 
A, we can determine its unit productions in time 0{\Ea\) by traversing Ea 
in preorder. Whenever we meet a Kleene star*- or a -I— node we continue the 
traversal in the corresponding subtrees. If we meet a -node, then it depends on 
whether the languages corresponding to the two subtrees contain the null string 
or not: if one of them does, then we continue the traversal in the other subtree 
(if both of them contain the null string, then we continue the traversal in both 
subtrees). When eventually we reach a node labeled with a nonterminal B, then 
that occurrence of B corresponds to a unit production for A. We define a new 
thread, which we refer to as the “unit thread” that connects all occurrences 
of nonterminals that correspond to unit productions in the expression forest. 
The unit threads are used in the second step. Moreover, while discovering unit 
productions, we define, for each nonterminal A, the set of nonterminals that are 
unit-reachable from A. These sets are needed to modify the production schemas. 
The overall running time for this step is 0(|G|). 

In the second step, we modify the expression trees at the nodes where we 
found unit productions. We traverse the expression trees from their frontiers to 
their roots and, in particular, we follow the paths that start from the nodes 
labeled with nonterminals that correspond to unit productions (we access them 
by following the unit threads). Assume, for semplicity that the expression trees 
do not contain Kleene (if the language of a subexpression contains the null 
string then we transform the expression using Kleene “-I-” and symbol A) and 
containing A expression trees Let Bi corresponding to one of these unit produc- 
tion in the expression tree named A. We go up the paths as long we meet only 
-I— nodes. 
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Consider what happens when we reach a -node and its subtrees both yield 
the null string. Let L and R be its subtrees and let B — {Bi, . . . ,Bh] and 
C = {Cl, . . . , Ck} be the occurrences of nonterminals corresponding to unit 
productions that are in L and R, respectively, that we have not yet “eliminated” . 
We replace the -subtree with a new tree that combines with +- and -nodes the 
following copies of the original subtrees: L, R, L without B, R without C, L 
without A and R without A. Using the notation [L], [i?], [L — B], [R — C], [L — A] 
and [i? — A] to denote these subtrees, we construct the expression: 

[L-B] + [R-C] + m ■ [R - A]) + ([L - A] • [R]). 

In this way we eliminate all unit nonterminal strings B U C from the language 
of the regular expression. When only one of the subtrees yields the null string, 
the transformation is simpler but similar. 

When we meet a *-node, we replace its subtree in a similar way. Let T be 
a subtree with a *-root node and let B = {Bi, . . . ,Bh} be the occurrences of 
nonterminals corresponding to unit productions that we have not yet eliminated. 
We replace that *-node with a subtree that combines with +- and Kleene "'■-nodes 
the following subtrees: A, T and T without B to give the regular expression 

\+[T - B] + [T]++. 

The time complexity of this second step is the same as that of the transformation 
of regular expressions into null-free expressions (see Section 

Finally, we need to modify the production schemas such that, for each non- 
terminal A, \i B\, . . . , Bh are all symbols unit reachable from A, then we define 
the new production schema as 

A—^Ea + Ebi -I- . . . -I- Eb^ ■ 

Rather than duplicating the corresponding trees in the expression forest (it would 
give a quadratic blow up in the size of the resulting grammar in terms of |G|) we 
define the new expression tree with -I— nodes and pointers to the trees named A 
and Bi, . . . , Bh, respectively. This takes time 0{^N'^) and only 0{^N'^) extra 
space is added to the size of the grammar G. 



3.4 Greibach Form 

This normal form result for context-free grammars was established by Sheila 
Greibach 0 in the 1960’s; it was a key result in the use of the multiple-path 
syntactic analyzer developed at Harvard University at that time. An extended 
context-free grammar is in Greibach normal form if its productions have only 
the following form: 

A — > aa, 

where a is a terminal symbol and a is a possibly empty string of nonterminal 
symbols. The transformation of an extended context-free grammar into Greibach 
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normal form requires two giant steps: left-recursion removal and back left sub- 
stitution. Recall that a grammar is left recursive if there is a nonterminal A such 
that A=>^ Aa in the grammar, for some string a. We consider the second step 
first. 

We assume that the given extended context-free grammar G = {N, a, P, S) 
is reduced, factored, null free, and unit free. In addition, for the second step we 
also assume that the grammar is not left recursive. Since the grammar is not 
left recursive there is a partial order on the nonterminals, left reachability, 
that is defined by A < 5 if there is a leftmost derivation A^*Ba. As usual, 
we can consider the nonterminals to be enumerated as A\, . . . , A„ such that 
whenever Ai < Aj, then i < j. Observe that A„ is already in Greibach normal 
form since it has only productions of the form A„— >a, where a € a. We now 
convert the nonterminals one at a time from A„_i down to Ai. The conversion 
is conceptually simple, yet computational expensive. When converting Ai, we 
replace all nonterminals that can appear in the first positions in the strings in 
L{Ea,) with their regular expressions. Thus, is now in Greibach normal 

form. To be able to carry out this substitution efficiently we first convert each 
regular expression Ea, into first normal form; that is, we break into the sum of 
regular expressions that each begin with a unique symbol. More precisely, letting 
O’ = {on+i , . . ■ , an+m} and using the notation Ei instead of Ea, for simplicity, 
we replace Ei by Ei which is defined as follows: 



Ei — Ai+i ■ EiA+i + • 

+ l5^+l ■ On+l + 



-\- Aji * Ei^ji 

' “k ^n+m ' On+mt 



where cr = {a„+i,... ,an+m}, = {A}, if a„+k G L(E^), and = 0, if 

On+k ^ L{Ei). The conversion should satisfy L{Ei) = L{Ei). Then, to convert 
Ei into an equivalent regular expression E[ in Greibach normal form, we need 
only replace the first A^ of each term with E'^. 

If the grammar is left recursive, we first need to remove or change left re- 
cursion. We use a technique introduced by Greibach 0, investigated in detail 
by Hotz and his co-workers jl 211 9l2llj and rediscovered by others jOj . It involves 
producing, for each nonterminal, a distinct subgrammar of G that is left linear 
and, hence, can be converted into an equivalent right linear grammar. This con- 
version changes left recursion into right recursion and does not introduce any 
new left recursion. For more details, see Wood’s text The important prop- 
erty of the left-linear subgrammars is that every sentential leftmost derivation 
sequence in G can be mimicked by a sequence of leftmost derivation sequences, 
each of which is a sentential leftmost derivation sequence in one of the left-linear 
subgrammars. Once we convert the left-linear grammars into right- linear gram- 
mars this property is weakened in that we mimic the original derivation sequence 
with a sequence of sentential rightmost derivation sequences in the right-linear 
grammars. The new grammar that is equivalent to G is the collection of the 
distinct right-linear grammars, one for each nonterminal in G. 

As the modified grammar is no longer left recursive, we can now apply left 
back substitution to obtain a final grammar in Greibach normal form. 
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How well does this algorithm perform? The back left substitution causes, in 
the worst case, a blow up of ^N\G\ in the size of the Greibach normal form 
grammar. Left recursion removal causes a blow up of ^N\G\ at worst. Lastly, 
converting the production schemas into first normal form causes a blow up of 
2^N\G\ (this bound is not obvious). As all three operations take time propor- 
tional to the sizes of their output grammars, essentially, Greibach normal form 
takes 0(#A^®|Gp) time, in the worst case. The reason for the term is that 
we first remove left recursion which not only increases the size of the grammar 
but also squares the number of nonterminals from to The number 

of nonterminals is crucial in the size bound for the grammar obtained by first 
normal form conversion and left back substitution. 

We can however, reduce the worst-case time and space by using indirection 
as we did for unit-production removal. Rather than performing the back left 
substitution for a specific nonterminal, we use a reference to its schema. This 
technique gives a blowup of only |G| -I- at most and, thus, it reduces the 
complete conversion time to be 0{^N^\G\) in the worst case. 

We can also apply the technique that Koch and Blum m suggested; namely, 
leave unit-production removal until after we have obtained Greibach-like normal 
form. Moreover, transforming an extended context-free grammar into dot normal 
form appears to be a very useful technique to avoid undesirable blow up in 
grammar size. We are currently investigating this and other approaches. 



4 Final Remarks 

The results that we have presented are truly a generalization of similar results for 
context-free grammars. The time and space bounds are similar when relativized 
to the grammar sizes. The novelty of the algorithms is four-fold. First, we have 
introduced the regular expression forest with symbol threads as an efficient data 
representation for context-free grammars and extended context-free grammars. 
We believe that this representation is new. The only previously documented 
representations are those of Gohen and Gotlieb ^ and of Barnes |2| and they 
are much more simplistic. Second, we have demonstrated how indirection using 
referencing can save time and space in the null-production removal and left 
back substitution algorithms. Although the use of the technique is novel in this 
context, it is well known technique in other applications. It is an application of 
lazy evaluation or evaluation on demand. Third, we have introduced dot normal 
form for extended context-free grammars that plays a role similar to Ghomsky 
normal form for standard context-free grammars. Fourth, we have generalized 
the left-linear grammar to right-linear conversion to the extended case. 

We are currently investigating whether we can get Greibach normal form 
more efficiently and whether we can improve the performance of unit- production 
removal. 

Lastly, we would like to mention some other applications of the regular ex- 
pression forest with symbol threads. First, we can reduce usefulness determi- 
nation to nullability determination. Given an extended context-free grammar 
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G = {N,a, P, S), we can replace every appearance of every terminal symbol 
with the null string to give G' = {N, 0, P', S'). Now, a nonterminal A in G is use- 
ful if and only if it is nullable in G' . Second, we can use the same algorithm to 
determine the length of the shortest terminal strings generated by each nonter- 
minal symbol. The idea is that we replace each appearance of a terminal symbol 
with the integer 1 and each appearance of the null string with 0. We then re- 
peatedly replace: each node labeled “-I-” that has two integer children with the 
minimum of the two integers; each node labeled that has two integer children 
with the sum of the two integers; and each node labeled with 0. The root 
value is the required length. 

We can use the same “generic” algorithm to compute the smallest terminal 
alphabet for the terminal strings derived from a nonterminal and so on. 



References 

1. A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation, and Compiling, 
Vol. I: Parsing. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1972. 

2. K.R. Barnes. Exploratory steps towards a grammatical manipulation package 
(GRAMPA). Master’s thesis, McMaster University, Hamilton, Ontario, Canada, 
1972. 

3. H.A. Cameron and D. Wood. Structural equivalence of extended context-free and 
extended EOL grammars. In preparation, 1998. 

4. D.J. Cohen and C.C. Gotlieb. A list structure form of grammars for syntactic 
analysis. Computing Surveys, 2:65-82, 1970. 

5. D. Connolly. W3C web page on XML. http://www.w3.org/XML/, 1997. 

6. A. Ehrenfeucht and G. Rozenberg. An easy proof of Greibach normal form. In- 
formation and Control, 63:190-199, 1984. 

7. D. Giammarresi and D. Wood. Transition diagram systems and normal form 
transformations. In Proceedings of the Sixth Italian Conference on Theoretical 
Computer Science, Singapore, 1998. World Scientific Publishing Co. Pte. Ltd. 

8. S.A. Greibach. A new normal form theorem for context-free phrase structure 
grammars. Journal of the ACM, 12:42-52, 1965. 

9. S.A. Greibach. A simple proof of the standard-form theorem for context-free gram- 
mars. Technical report. Harvard University, Cambridge, MA, 1967. 

10. M.A. Harrison. Introduction to Formal Language Theory. Addison- Wesley, Read- 
ing, MA, 1978. 

11. M.A. Harrison and A. Yehudai. Eliminating null rules in linear time. Computer 
Journal, 24:156-161, 1981. 

12. G. Hotz. Normal-form transformations of context-free grammars. Acta Cybernet- 
ica, 4:65-84, 1978. 

13. H.B. Hunt HI, D.J. Rosenkrantz, and T.G. Szymanski. On the equivalence, con- 
tainment and covering problems for the regular and context-free languages. Journal 
of Computer and System Sciences, 12:222-268, 1976. 

14. ISO 8879: Information processing — Text and office systems — Standard Gener- 
alized Markup Language (SGML), October 1986. International Organization 
for Standardization. 




12 



Jurgen Albert, Dora Giammarresi, and Derick Wood 



15. P. Kilpelainen and D. Wood. SGML and exceptions. In C. Nicholas and D. Wood, 
editors. Proceedings of the Third International Workshop on Principles of Docu- 
ment Processing (PODP 96), pages 39-49, Heidelberg, 1997. Springer- Verlag. Lec- 
ture Notes in Computer Science 1293. 

16. R. Koch and N. Blum. Greibach normal form transformation revisited. In Reischuk 
and Morvan, editors, STACS’97 Proeeedings, pages 47-54, New York, NY, 1997. 
Springer- Verlag. Lecture Notes in Computer Science 1200. 

17. D. R. Raymond and D. Wood. Grail: Engineering automata in C-|— 1-, version 2.5. 
Technical Report HKUST-CS96-24, Department of Computer Science, Hong Kong 
University of Science & Technology, Clear Water Bay, Kowloon, Hong Kong, 1996. 

18. D.R. Raymond and D. Wood. Grail: A C-|— I- library for automata and expressions. 
Journal of Symbolic Computation, 17:341-350, 1994. 

19. R.J. Ross. Grammar Transformations Based on Regular Decompositions of 
Context-Free Derivations. PhD thesis, Department of Computer Science, Wash- 
ington State University, Pullman, WA, USA, 1978. 

20. R.J. Ross, G. Hotz, and D.B. Benson. A general Greibach normal form transforma- 
tion. Technical Report CS-78-048, Department of Computer Science, Washington 
State University, Pullman, WA, USA, 1978. 

21. A. Salomaa. Formal Languages. Academic Press, New York, NY, 1973. 

22. D. Wood. Theory of Computation. John Wiley & Sons, Inc., New York, NY, 1987. 

23. D. Wood. Theory of Computation. John Wiley & Sons, Inc., New York, NY, second 
edition, 1998. In preparation. 

24. A. Yehudai. On the Complexity of Grammar and Language Problems. PhD thesis. 
University of California, Berkeley, CA, 1977. 

25. D. Ziadi. Regular expression for a language without empty word. Theoretical 
Computer Seience, 163:309-315, 1996. 




On Parsing LL-Languages 



Norbert Blum 



Informatik IV 
Universitat Bonn 

Romerstr. 164, D-53117 Bonn, Germany 
blumScs .uni-bonn. de 



Abstract. Usually, a parser for an LL(fc)-grammar G is a determin- 
istic pushdown transducer which produces a leftmost derivation for a 
given input string x £ L{G). Ukkonen has given a family of LL(2)~ 
grammars proving that every parser for these grammars has exponential 
size. If we add to a parser the possibility to manipulate a constant num- 
ber of pointers which point to positions within the constructed part of 
the leftmost derivation and to change the output in such positions, we 
obtain an extended parser for the I/L(fc)-grammar G. Given an arbitrary 
I/I/(fc)-grammar G, we will show how to construct an extended parser of 
polynomial size manipulating at most pointers. 



1 Definitions, Notations, and Basic Results 

We assume that the reader is familiar with the elementary theory of LL{k)~ 
parsing as written in standard text books (see e.g. CEEl). First, we will review 
the notations used in the subsequence. 

A context-free grammar (cfg) G is a four-tuple (V, A, P, S) where U is a 
finite, nonempty set of symbols called the total vocabulary, A C U is a finite set 
of terminal symbols, N = V\S is the set of nonterminal symbols (or variables), 
P is a finite set of rules (or productions), and S' G TV is the start symbol. The 
productions are of the form A ^ a, where A G N and a G V*. a is called 
alternative of A. L{G) denotes the context-free language generated by G. The 
size |G| of the cfg G is defined by 

|G|= ^ lg{Aa), 

A— >aGP 

where lg{Aa) is the length of the string Aa. Let e denote the empty word. 

A derivation is leftmost if at each step a production is applied to the leftmost 
variable. A sentential form within a leftmost derivation is called left sentential 
form. A context-free grammar G is ambiguous if there exists x G L{G) such 
that there are two distinct leftmost derivations of x from the start symbol S. A 
context-free grammar G = {V, E, P, S) is reduced if P = 0 or, for every A G V , 
S =^* aAj3 =>* w for some a,(3 G V*,w G E*. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 13-|^3 1999. 
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A pushdown automaton M is a seven-tupel M = (Q, S, F, 5, qo, Zq, F), where 
Q is a finite, nonempty set of states, A is a finite, nonempty set of input symbols, 
_r is a finite, nonempty set of pushdown symbols, qo G Q is the initial state, 
Zq G F is the start symbol of the pushdown store, F C Q is the set of final 
states, and (5 is a mapping from Q x (A U {e}) x _T to finite subsets of Q x F* . 

Given any context-free grammar G = (V, S, P, S), we will construct a push- 
down automaton Me with L{Mc) = L{G). For the construction of Mq the 
following notation is useful. 

A production in P with a dot on its right side is an item. More exactly, 
let p = A — > X 1 X 2 ■ ■ • G P. Then (p,i), 0 < f is an item which is 

represented by [X X 1 X 2 . . . A* • A^+i . . . A„^]. Let Fla = {{p,i) \ p G P,0 < 
i < np} be the set of all items of G. Then Mq = {Q, S, F, 5, qo, Zq, F} is defined 
by 

Q = HgU{[S' ^ .S],[S' ^ S.]}, 
q^ = {[S'^.S]},F={[S'^S.]}, 

F = QU {T}, Zq =T, and 
S-.Qx (AU{e}) 

6 will be defined such that Mq simulates a leftmost derivation. With respect to 
6, we distinguish three types of steps. 

(E) expansion 

5([A ^ (3 ■ A-f],e, Z) = {([A ^ -a], [A ^ /3 • Ay]Z) \A^aGP}. 

The leftmost variable in the left sentential form is replaced by one of its 
alternatives. The pushdown store is expanded. 

(C) reading 

<5([A ^ ip ■ a'f],a, Z) = {([A ^ pa ■ if\,Z)}. 

The next input symbol is readed. 

(R) reduction 



,5([A ^ a-],e, [W ^ p- Xv] = {{[W pX ■ v],e)}. 

The whole a is derived from A. Hence, the dot can be moved beyond A and 
the corresponding item can be removed from the pushdown store, getting 
the new state. Therefore, the pushdown store is reduced. 

A pushdown automaton is deterministic if for each q G Q and Z G F either 

i) 5{q, a, Z) contains at most one element for each a G X and S{q, e,Z) = % or 

ii) 5{q, a,Z) = 11} for all a G A and S{q,s, Z) contains at most one element. 

A deterministic pushdown tranducer is a deterministic pushdown automaton 
with the additional property to produce an output. More formally, a determinis- 
tic pushdown tranducer is an eight-tuple {Q, X, F, A, S, qo, Zq, F), where all sym- 
bols have the same meaning as for a pushdown automaton except that A is a 
finite output alphabet and 6 is now a mapping 6 : Qx (AUje}) xT 1 — > QxF*xA*. 
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For a context-free grammar G = {V,S,P,S), an integer k, and a G V* 
F I RSTk{a) contains all terminal strings of length < k and all prefixes of length 
k of terminal strings which can be derived from a in G. More formally, 

FIRSTk(a) = {x G S* \ a =^* xy,y G S* and |a:| = k ov a =>* x and |a;| < k}. 

A usual way to represent a finite set of strings is the use of tries. Let A be a finite 
alphabet and |I7| = s. A trie with respect to A is a directed tree T = {V,E) 
where each node v G V has outdegree < s. The outgoing edges of a node v are 
marked by pairwise distinct elements of the alphabet S. The node v represents 
the string s{v) which is obtained by the concatenation of the edge markings on 
the unique path from the root r of T to u. An efficient algorithm without the 
use of fixed-point iteration for the computation of all F I RSTk-sets can be found 
in 0 . 

Let G = {V, A, P, S) be a reduced, context-free grammar and A: be a positive 
integer. We say that G is LL{k) if G fulfills the following property: If there are 
two leftmost derivations 

1. S wAa =^im wf3a wx and 
2 - S w^a wy 

such that FIRSTk{x) = FIRSTk{y), then /? = 7 . 

The following implication of the LL(k) definition is central for the construc- 
tion of LL(fc)-parsers. The proof can be found in 0. 

Theorem 1. A cfg G = (V,S,P,S) is LL{k) if and only if the following 
condition holds: If A ^ fd and A — > 7 are distinct productions in P , then 
FIRSTk{l3a) n FIRSTk{'ja) = 0 for all wAa such that S =^* wAa. 

A parser for an LL(A:)-grammar G is a deterministic pushdown tranducer 
which produces a leftmost derivation for a given input x G L{G). If we add 
to a parser the possibility to manipulate a constant number of pointers which 
point to positions within the constructed part of the leftmost derivation and 
to change the output in such positions, we obtain an extended parser for the 
LL(A:)-grammar G. 

Ukkonen 0 has given a family of LL(2)-grammars and shown that every 
parser for these grammars must have exponential size. Given an arbitrary LL(k)- 
grammar G, we will show how to construct an extended parser of polynomial 
size manipulating at most k'^ pointers. 

2 The Construction of Polynomial Size Extended 
TT(fc)-Parser 

2.1 A Motivating Example 

Given any LL(/c)-grammar G, our goal is the construction of an extended LL(k)- 
parser of polynomial size for G. In order to explain the main idea, we consider the 
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LL(2)-grammar, given by Ukkonen for proving an exponential lower bound 
on the size of any parser. For n G IN consider the cfg G„ = Aq) 

defined by 

Pn : Ai ^ Qi+iAi+iBi+i I di+i A^+lCi+l {0 < i <n — 1) 

An^ bi\s (1 < i < n) 

I e (1 < i < n) 

Ci ^ Ci\ e (1 < i < n) 

It is easy to see that Gn is an LL(2)-grammar. Let us first review Ukkonen’s 
idea for proving the exponential lower bound for the size of any LL(fc)-parser, 
k > 2 for Gn- Consider the leftmost derivation of a string XiX 2 ■ ■ -XnbjCj . . . , 
where Xi G {ai,di}, 1 < i < n. The critical point is when the parser has to 
determine the correct alternative for An with respect to the left sentential form 
X\X 2 ■ ■ -XnAnXnXn-i ■ ■ ■ X\. It needs the knowledge of whether Xj = aj or 
Xj = dj. In the case that Xj = Oj, e would be the correct alternative, since bjCj 
would be derived from Bj. In the other case, cj would be derived from Gj and 
hence, bj would be the correct alternative for An- In the case that the parser 
does not have the possibility to continue the parsing process and to determine 
the correct alternative for An later, there are two possibilities to obtain this 
information: 

1. The information is contained in the actual state q of the parser and the 
topmost stack symbol Z . But then, the number of distinct pairs (g, Z) must 
be exponential in n since there exist 2" distinct strings xiX 2 - - - Xn with 
Xi G {ai,di}, 1 < i < n, and j is not known in advance. This would imply 
that the size of the parser must be exponential in the size of the grammar. 

2. The parser looks into the stack whether Bj or Gj is pushed. But then, the in- 
formation above Bj and Gj , respectively must be stored in the current state q 
and the current topmost stack symbol Z . But then, the number of pairs (g, Z) 
must be exponential in the size of the grammar, since for the left sentential 
form xiX 2 - - -XnAnXnXn-i - - - Xj+iXj . . . with Xi G {Bi,Ci},n > I > j, 
the number of possible distinct strings XnXn-i - - - Xj+i is 2"“-^ . This would 
also imply that the size of the parser must be exponential in the size of the 
grammar. 

If we allow to continue the parsing process and to delay the determination of 
the correct alternative for An to the future, then we will obtain an extended 
LL(2)-parser for Gn of size 0{n). We only need a pointer to the correct position 
of the production with left hand side An in the computed leftmost derivation. 
The extended parser works in the following way. 

— The correct alternatives for the variables Aq,Ai^... , can be derived 

immediately from the first unread symbol of the input. The corresponding 
productions can be written directley into the leftmost derivation under con- 
struction. 

— Let us consider the moment, when the parser wants to expand An- If the 
first unread symbol of the input is in {ci, C 2 , . . . , c„} then it is clear that e 
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is the correct alternative of An- If the first unread symbol of the input is 
bj, 1 < j < n, and the second unread symbol is yf Cj, then bj is the correct 
alternative of An- In both cases, the correct production can be written into 
the leftmost derivation under construction. 

If the first and second unread symbols are bj and Cj, 1 < j < n, then the 
parser cannot determine the correct alternative of An- The parser creates 
a pointer, pointing to the correct (i.e. current) position for the production 
with left hand side An in the leftmost derivation under construction and 
continues with the construction of the leftmost derivation. 

For each variable of . . .Xj+i it is clear that e is the correct alter- 

native and the parser writes successively the corresponding production into 
the leftmost derivation under construction. 

Let us consider the moment, when Xj should be expanded. If Xj = Bj then 
bjCj is the correct alternative for Xj and e is the correct alternative for An- 
Otherwise, i.e., Xj = Cj, Cj is the correct alternative of Xj and bj is the 
correct alternative of An- In both cases, the parser can write the correct 
production with respect to An at the correct position of leftmost derivation 
under construction. Furthermore, the production corresponding to Xj can 
be written into the leftmost derivation. 

— For all further variables, the correct alternatives can be derived from the 
first unread symbol of the input. Hence, the corresponding productions can 
be written directly into the leftmost derivation. 



2.2 The General Construction 

Our goal is now to generalize the method explained with help of the example 
above such that it can be used for any LL(/c)-grammar. 

Let G = {V, E, P, S) be an arbitrary LL(/c)-grammar. Consider the pushdown 
automaton Mq for G. Our goal is to construct the extended parser Pq by a step- 
by-step extension of Mq- For doing this, assume that w is the read input, x is 
the substring of length k which follows w and that [X P-A'y], H G IV is the 
state of the pushdown automaton. Then, Pq wants to expand the variable A. 
Furthermore, assume that Pg knows no symbol oi x = X\X 2 ■ ■ - Xk- 

Our goal is now to determine the prefixes of x which can be derived exactly 
from A. These can be only prefixes x' of x such that x' G FIRSTk{A). Hence it 
is useful that 

— Pg computes the maximal prefix u of a; which is also prefix of an element of 
FIRSTk(A). 

For an efficient solution of this subgoal, we add for each variable X G N the trie 
Tk{X), corresponding to FIRSTk{X) to Pg- Now, 

— Pg starts to read the lookahead x and, simultaneously, follows the corre- 
sponding path in Tk{A), starting at the root, until the maximal prefix u of 
X in Tk{A) is determined. 
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Depending on whether |u| < |a;| or |m| = |a;|, we distinguish two cases. 

Case 1 : |u| < |x| 

It follows directly that the terminal string which is derived form A must be 
a prefix u' from u. Furthermore, u' G FIRSTk{A). Hence, it is useful if Pq has 
direct access to such prefixes. For getting this, every node v G Tk{A) contains a 
pointer to each node w G Tk(A) such that 

1. w is on the path from the root of Tk(A) to v, and 

2. s{w)G FIRSTk{A). 

Since Pq does not know the whole lookahead x, possibly Pq cannot decide which 
of these < k prefixes is the correct string which is derived from A within the 
leftmost derivation of the input. Hence, Pq has to store all these possibilities. 

For doing this, Pq initializes for each of these prefixes u' a queue which 
contains one element. This element contains the variable A, the terminal string 
u', and a pointer to the position within the leftmost derivation at which the 
derivation of u' from A has to be written if u' would be the correct terminal 
string. 

Now, the parser Pq continues the construction of the leftmost derivation with 
the first not expanded variable or not read terminal symbol of the state or an 
item in the pushdown store. For simplicity, we assume that the current state is 
always stored at the top of the pushdown store. Since for all items, the treatment 
of the symbols before the dot is finished and the symbol directly behind the dot 
is a variable which has just been expanded, the needed symbol is at the second 
position behind the dot of an item. Hence, as long as the topmost item in the 
pushdown store contains less than two symbols behind the dot, Pq performs 
pop-operations. 

Assume that B is the second symbol behind the dot of the topmost item in 
the pushdown store. Then H is a nonexpanded variable or H is a terminal symbol 
which is not treated by a reading step. In both cases, Pq considers iteratively 
all queues. Assume that M is the queue under consideration and that j is the 
length of the prefix m of a: which corresponds to the queue M . We distinguish 
two cases. 

Case 1.1: BgN 

It is useful if Pq knows the maximal prefix of the next k — j input symbols 
which is also a prefix of an element in FIRSTk{B). In contrast to the above, 
Pq already read | m | — j of these k — j symbols. Hence, Pq needs the possibility 
of direct access to the “correct” node in Tk{B) with respect to the read prefix 
of the next k — j symbols. Then, Pq can process analogously to the above. 

For getting this direct access, we extend Pq by a trie Tq representing the set 
Also, Pg manipulates a pointer P{Tg) which always points to the node u 
in Tg with s{u) is the prefix of the lookahead x already read. 

For V G Tg let d{v) denote the depth of v in Tg and Si{v), 0 < i < d{v) 
denote the suffix of s{v) which starts with the (z -I- l)st symbol of s{v). Every 
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node V G Tq contains for all A G N and 1 < z < d{v) a pointer Pi^A{v) which 
points to the node w G T^{A) such that s{w) is the maximal prefix of an element 
of FIRSTk{A) which is also a prefix of Si{v). 

Using the pointer Pj^siu), where u is the node to which P{Tq) points, Pq 
has direct access to the correct node w in Tk{B). 

If s{w) 7 ^ Sj{u) then s(w) is the maximal prefix of Sj{u) which is prefix of 
an element of FIRSTk{B). In this case, Pq proceeds analogously to the above. 
But instead of the initialization of new queues, Pq extends the queue M under 
consideration in an appropriate manner. This can be done as follows. 

For every prefix zt' of s{w) with u' G FIRSTk(B), the node corresponding 
to the head of the queue A obtains a new successor corresponding to the prefix 
u'. If no such u' exists, then Pq is in a dead end with respect to the queue M 
and M can be deleted completely. 

If s{w) = Sj{u) then Pq continues to read the rest of the lookahead x and, 
simultaneously, follows the corresponding path in Tk{B), starting at the node w. 

Case 1.2: BgS 

If B = Xj+i then Pq extends the queue M under consideration by adding a 
new successor corresponding to to the node corresponding to the head of 
M. Otherwise, Pq is in a dead end with respect to the queue M and M can be 
deleted completely. 

Case 2: |zz| = |x| 

Then x G FIRSTk{A). Theorem 1 implies that for every proper prefix u' of x 
such that u' G F I RSTk{A) it holds that with respect to the leftmost derivation 
for the input string of Pq, u' cannot be the terminal string which is derived from 
A. Moreover, at most one alternative a of A with x G F I RSTk{a) exists. Hence, 
Pg can determine the correct expansion step. 

Altogether, the data structure for the set of queues is a forest. The path from 
the root of a tree to a leaf always corresponds to an actual queue and vice versa. 
The leaf is the head of the queue and the root is the tail of the queue. Every node 
on the path from the tail to the head corresponds to a substring of x. The prefix 
of X corresponding to the queue can be obtained by the concatenation of these 
substrings. As observed below, the corresponding prefixes of two distinct queues 
are distinct. Moreover, the roots of two distinct trees of the forest correspond to 
distinct prefixes of x. 

Next, we want to derive an upper bound for the number of queues. Note that 
an LL(fc)-grammar is always unambiguous. 

Lemma 1. The number of queues is always < k. 

Proof. Consider the moment, when for the current symbol all work with respect 
to all the queues is done. Each such a moment corresponds to a string 7 = 
A'y' G NV*, where Pq does not know the correct alternative for A. The queues 
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correspond to pairwise distinct derivations of a proper prefix of x from 7. Note 
that e is a prefix of length 0 of x. 

Assume that two distinct queues correspond to a derivation of the same 
prefix of X. Then there exists a word in L(G) which has two distinct leftmost 
derivations from the start symbol S. Note that G is reduced. But this would be 
a contradiction to the unambiguity of G. Hence, at most one queue corresponds 
to every prefix of x. Hence, the number of queues is bounded by k. □ 

Now we want to bound the length of every queue by k. If we take care that 
every node up to the head of a queue corresponds to a string of length > 0, 
then every queue contains at most k nodes. For getting this property, every 
time when £ can be the string derived from the variable C under consideration, 
Pq writes the unique derivation of e from G into the leftmost derivation under 
construction. In the case that another string u would be derived from C, this 
wrong derivation will be superscribed later by the correct derivation of u from 
G. Hence, always when e is the correct terminal string derived from a variable 
C, the correct derivation of £ from C is written at the leftmost derivation under 
construction and nothing has to be changed with respect to this part of the 
derivation. We need the possibility that the head of a queue can correspond to 
the terminal string e for the case, that the father of this head has another son 
and also for the case that there is a queue which corresponds to the string e. By 
construction, nodes corresponding to e need no pointer to a position within the 
leftmost derivation under construction. In the case that the head of the queue 
M under consideration corresponds to £, instead of the head of M, the father of 
the head of M obtain new successors. 

Altogether, the number of nodes within the forest is bounded by k"^ and 
hence, the number of pointers is bounded also by k^. 

The earliest moment when the prefix of x derived from A can be determined is 
the moment when the forest consists of exactly one tree. This prefix corresponds 
to the root of this unique tree. By Theorem 1, this must happen when the 
last symbol of the lookahead is read or earlier. When the forest consists of 
exactly one tree, the prefix u' of x corresponding to the root r of this tree is 
the terminal string which is derived from A. Pq modifys the leftmost derivation 
under construction as follows. 

— Using the pointer of the root r, Pq has access to the position within the 
leftmost derivation under construction, where the leftmost derivation of u' 
from A has to be written. We can assume that this derivation is precomputed 
such that Pq can write this derivation at the correct position. 

— The root r is deleted. Eventually, the new forest contains now more than 
one tree. 

Since u' is derived from A, the new lookahead is shifted |m'| positions. Hence, Pq 
has to update the pointer P{Tc)- For doing this within constant time, each node 
V G To contains d{v) — 1 pointers Pi{v), 1 < i < d{v) pointing to the unique 
node w G Tq with s{w) = Si(v). Now, Pc performs the following steps. 
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~ Using the pointer P|„/|(r;), where v is the node to which P{Tg) points, Pq 
has direct access to the node, to which the pointer P{Tg) has to point. After 
updating the pointer P(Tg), the parser continues its work at that point, 
where the work was interrupped. 

Next, we want to bound the size of Pq and the parsing time. By construction, Pq 
contains |fV| + 1 tries. Each trie consists of at most 2\S\^ nodes. For all nodes in 
Tg, the number of pointers is bounded by (|fV| + l)k. For all A G fV for all nodes 
in Tk{A), the number of pointers is at most k. Hence, all tries need 0{k\N\\E\’^) 
space. The space for the precomputed leftmost derivations for terminal strings 
of length < A: is bounded by 0(|fV| | A|^). Note that G is an LL(fc)-grammar. 

For each reading and expansion, respectively at most k queues are considered. 
Since the number of queues does not exceed k, the whole time per step for 
the construction of queues is 0{k). For the decomposition of queues no more 
time than for the construction is needed. Hence, the parsing time is bounded 
by 0(k ■ length of the derivation). Altogether, we have proven the following 
theorem. 

Theorem 2. Let G = (V,S,P,S) be an LL{k) -grammar. Then there is an 
extended parser Pg for G which has the following properties. 

i) The size of the parser is 0{k\N\\S\^)). 

ii) Pg needs only the additional space for at most pointers. 

Hi) The parsing time ist bounded by 0{k • length of the derivation). 

Acknowledgment: I thank Claus Rick for critical remarks. 
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Abstract. LR parsers have long been known as being an efficient al- 
gorithm for recognizing deterministic context-free grammars. In this 
article, we present a linear-time method for parsing substrings of LR 
languages. The algorithm depends on the LR automaton that is used for 
the usual parsing of complete sentences. We prove the correctness and 
linear complexity of our algorithm and present an interesting extension 
of our substring parser that allows to condense the input string, which 
increases the speed when reparsing that string for a second time. 



1 Introduction 

The problem of recognizing substrings of context-free languages has emerged 
in several interesting applications and can be described as follows. Given a 
string y and a grammar G = (V, S, P, S), we wish to know whether there exist 
two additional strings x and 2 such that xyz is a sentence of G. An important 
application for a corresponding substring parser is a method for detecting syntax 
errors suggested by Richter |^, although his article does not contain such a 
parser. The ability to decide whether a part of a given program is not a substring 
of a programming language allows the local detection of syntax errors without 
performing a complete parsing process. 

Several substring parsers suffering from various drawbacks have already been 
presented before. Gormack’s algorithm ^ and a parallel version of it PI only 
work with the bounded-context class of grammars, which is a proper subset of the 
LR(1) class. Bates and Lavie’s approach 0 is applicable for SLR(l), LALR(l) 
and all canonical LR(/c) grammars, but their correctness proof as well as their 
complexity analysis are incorrect, as we shall show later. 

In this article, we develop a substring parser that can be used with SLR(/c), 
LALR(fc) and canonical LR(A:) grammars. Even more, our parser determines the 
maximum prefix of an input string y that represents a substring. We also present 
another interesting feature called condensation of substrings. This feature allows 

* This work is dedicated to my mother. 
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to transform an input string y into a string (3 € V'^ such that the following two 
conditions are satisfied: 

(3 =^G y and M x,z: xyz S L{G) => x^z is a sentential form of G . 

Thus, all reductions stored in a condensation string (3 of an input string y must 
always be done when parsing a string that contains y as a substring. This allows 
the replacement of y with (3 and increases the processing speed when reparsing 
the resulting string because the mentioned reductions are automatically skipped. 

We begin in Sect. 2 with a review of the basic terminology and definitions used 
throughout this paper. The algorithm is then explained in Sect. 3. In this first 
version, the parser is only applicable to SLR(l), LALR(l) and LR(1) grammars. 
The correctness of this algorithm is proven in Sect. 4, and Sect. 5 deals with 
the analysis of its linear time complexity. Section 6 describes the condensation 
feature, and Sect. 7 deals with the necessary modifications in order to use the 
substring parser with canonical LR(fc), SLR(fc) and LALR(fc) grammars, where 
k > 1. An appendix pinpoints the already mentioned flaws in Bates and Lavie’s 
work. 



2 Terminology and Definitions 

In this section, the basic terminology and definitions used in this article are 
introduced. We assume that the reader is familiar with the LR parsing technique. 
For more information, the reader is directed to Q. 

A context-free grammar is a quadruple G = (V, S, P, S), where V is the set 
of grammar symbols called the vocabulary, A C R is a set of terminal symbols, 
N := R \ A is the set of variables, P is the set of productions (or rules), and 
S G N is the start symbol. A production is of the form A ^ a, where A G N and 
a gV* . We use A — > 71 | ... | 7^ to denote the productions A ^ 71, . . . , A ^ 7„. 
We assume that G is always unambiguous and reduced, i.e., G does not contain 
any unnecessary symbols. 

Letters used in formulas have the following meaning. Upper and lower case 
letters at the beginning of the alphabet denote variables and terminals, respec- 
tively, whereas upper and lower case letters at the end of the alphabet are general 
grammar symbols in V and terminal strings in S* , respectively. Greek lower 
case letters denote vocabulary strings in V*. 

A substring of G is a string y such that there exist some x, z with xyz G L{G), 
where L{G) denotes the language generated by G. We use SS{G) and S{G) to 
denote the set of all substrings and the set of all sentential forms, respectively. 

Let k G Nq. a quadruple {A,a, /3,x), written [A ^ a,f3,{x}\, is called an 
LR(k)-item of G if A ^ af3 G P and x G E-^, i.e., x G E* and |a;| < k. An 
SLR(fc), LALR(fc) or LR(/c) parser always depends on several sets containing 
these items because these sets make up the states of its corresponding LR- 
DFA, which is a deterministic finite automaton that is used to build up the 
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usual functions Action and Goto. For example, the appropriate LR-DFA for an 
SLR(l) grammar consisting of the rules 

S — > Ab\aa\ aAb A — > aa 

is given in Fig. 1. Its start state, generally denoted by qo, is positioned in the 
upper left corner. The emphasized states < 72 , <76 and qy are the final states. They 
all contain an item of the form S —>■ which indicates that the parsing 

process is complete. 




Fig. 1. The state transition diagram of the sample SLR(l) grammar 



We denote the set of all states with Q, and the transition function (which in 
fact represents the function Goto) with <5. We extend the domain of S on sets 
S C Q and strings in F* in the usual way by establishing 

6{S, a) := {6{q, a)\q G S'}, S{S, e) = S and 6{S, aw) = S{S{S, a), w) . 

All incoming edges of a state q in an LR-DFA are labelled with an unique 
symbol which we denote by <f{q). The corresponding function ip : Q ^ V 
can be easily extented to a function ip that accepts strings si . . . in Q"*" by 
establishing <^(si . . . Sr) := p{s 2 ) ■ ■ . </j(sr). Note that since si is often equal to 
<7o in the following applications of ip, the first state si is omitted in the definition 
of this function because qo does not have any incoming edges. 

A configuration of an LR parser describes the status the parser is currently in. 
More precisely, a configuration is a pair (si . . . Sr, ai . . . a„), where si, . . . , are 
the states currently pushed on the stack (with Sr on the stack top), and Oi . . . a„ 
is the rest of the input string ai . . . a„ that has not yet been read. Clearly, from 
the way an LR parser works, si must be equal to qo, and, for j = 1, . . . , r — 1, 
i 5 (sj, 79(57-1-1)) = S7-1-1. 

Only a part of a configuration is used to determine the next parsing step 
because an LR(/c) parser only looks at the next k symbols ai, . . . , amin{i+fe-i,n}- 
Furthermore, the function Action only additionally needs the top state Sr as a 
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parameter. Thus, we can define a partial configuration by dropping the condition 
Si = qo and only demanding that the string part contains at least the next k 
symbols. The stack part of such a configuration is then called a partial stack. 

Let k\ , k2 be two (partial) configurations. We write k\ |-^ ^2 and k\ pLzta 
to denote that k2 results from k\ due to shifting the next symbol and due to 
performing a reduction by the rule A —>■ a, respectively. Moreover, ki |-^ k2 and 
k\ I — k2 mean that k2 results from k\ due to a reduction by some uninteresting 
rule and due to any single parsing step, respectively. For example, the parsing 
process of the string aaab G L{G), where G is the grammer in the example given 
above, can be described as follows: 

(go, aaab) (qoqi,aab) (go9i92, ab) 

( 90919293 , &) H^=^(9o9i 95,^) (9o9i9597,£) ■ 

The refiexive and transitive closure of | — is denoted by | — * . Note that from 
the way an LR parser works, if (qo,vw) \ — * (s,w), then S ip{s)w vw, 
where denotes the usual rightmost derivation. 

3 The Algorithm 

We now outline the idea which supports our substring parser. In this first version, 
it is suitable for any LR-parser with an one-symbol-lookahead (for example, an 
SLR(l) parser). 

Let us assume that y is the input string used with our algorithm. The 
substring parser simulates the behaviour of the LR parser when processing the 
part y of some input string xyw. The difference to a regular parsing process 
is that in this case the configuration the LR parser is in after processing x is 
now unknown. Moreover, there are usually many different possibilities for x and 
the corresponding configurations. Our algorithm gets around this problem by 
managing several partial configurations at the same time which on one hand 
correspond to these complete configurations, and on the other hand contain all 
the information that result from parsing the substring y. Using this idea, we 
now analyse how the partial configurations must look like at the beginning. 

After processing the prefix x of the complete input string xyw, the original LR 
parser starts the parsing process ofy = az by shifting a. Clearly, just before this 
step, the top state q on the stack must satisfy the condition Action(q, a) = Shift. 
Conversely, every state q with this property can in fact make up the topmost 
state because there always exists some string x and a path p from go to g in 
the LR-DFA such that <p(p) x and (qo,xaz) | — * (p,az) (pq',z), where 

q' := 6{q, a). For example, in Fig. 1 , let g := 94, and let b be the shifted symbol. 
Then we can choose p := 9094 and x := aa because x can be derived from 
(fi{p) = A. As desired, we obtain: 

(qo,aabz) (go9i,a6z) (909192,62) | (9094,62) (909496,2) . 

Thus, our substring parser starts with all partial configurations of the form 
(q,az), where 9 satisfies the above condition. Then, the algorithm alternately 
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simultates the shift and reduce operations of the LR parser. Clearly, at first a 
shift operation must be simulated. Corresponding to the action an LR parser 
performs in such a situation, this is simply done by pushing a new state S{q, a) on 
every partial stack, where q denotes the respective topmost state and a denotes 
the next input symbol. The input pointer is then advanced to the next sym- 
bol. Thus, during the first shift simulation, every partial configuration (q, az) is 
replaced with {qS{q, a), z). Later shift operations are handled in the same way. 

Between two shift operations, the substring parser simulates the correspond- 
ing reductions. Provided that the partial stack of a configuration is large enough, 
this again can be done in the usual way. For example, using Fig. 1, consider 
the configuration (go<Zi92,^)- The LR parser then reduces the stack due to the 
rule A —>■ aa, and so does our algorithm. The resulting configuration after- 
wards is (qoq4,b). But when the partial stack does not contain enough states, 
the algorithm has to take care about all possible extensions of it. For example, 
let (qiq2,b) be the starting configuration. Now the reduction A ^ aa cannot 
directly be handled because at least one state q is missing at the stack bot- 
tom. Clearly, qi must be accessible from q because the partial stack always 
corresponds to a path in the LR-DFA. Thus, q must be equal to go, and again 
(go94, b) is the resulting configuration. If qi had been accessible from another 
state q' , then the substring parser would have generated another configuration 
{q'S{q' , A),b). In general, when there are r states missing in the stack and q is 
the bottom state, the algorithm has to consider the states in 6 ~^{q,r), where 
the function : Q x Nq ^ 2*^ is defined as follows: 

S~\q,r) := {g' |3a G PC S{q',a) = q} . 

Then for every g G S~^{q, r), a new partial configuration with the stack contents 
qS{q,A) is generated. 

We now discuss the management of the partial configurations. The substring 
parser maintains a directed labelled graph Gr = (V,E,l), where V, E and 
I denote a set of vertices (or nodes), a set of edges, and a labelling function 
I : V ^ Q, respectively. The graph structure consists of several trees, and the 
root nodes of these trees are collected in a set T. We are exclusively interested 
in the maximum paths contained in the trees (i.e., paths from leafs to root 
nodes), and therefore from now on, when speaking of a path, we always mean a 
maximum one. Let p be such a path in Gr and let \p\ denote the length of p, i.e., 
the number of edges in it. Let s, be the label of the f-th node, 1 < f < |p| -I- 1. 
Finally, let y = yij/2, where y\ is the prefix read so far. Then p represents all 
configurations (s,y2z) with the following two properties. Firstly, {s,y2z) can be 
obtained by parsing the prefix xyi of a sentence xyip2Z with the original LR 
algorithm. Secondly, the labelling of p, defined as l{p) := si . . . S|p|_|_i, is a suffix 
of s. 

Conversely, every configuration with the first property is represented by some 
path in the graph. Hence, yi is a substring of L{G) iff the graph is not empty. 
Furthermore, a path p can be discarded from the graph if there exists another 
path p' such that l{p') is a suffix of l{p). 
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Two alternately called procedures maintain the graph. The first one, Shift- 
CommonSymbol (SCS), changes the paths in the same manner as the LR algo- 
rithm changes the stack portion of a configuration due to a shift operation, i.e., 
a path that ends in a root node labelled with some state q will be extended by 
a new node labelled with the state 6{q,a), where a denotes the shifted symbol. 
The other procedure, ReduceStacks (RS), simulates reduce operations in a sim- 
ilar way, i.e., for a reduction according to some grammar rule ^ ^ a, at first 
the substring parser drops |a| nodes and edges from the end of a path and then 
appends a new node labelled with 6{l{v),A), where v is the last node of the path 
after the first step. When a path p contains fewer than |a| edges, it is necessary 
to replace p by several new short paths. Each of them consists of two nodes, and 
the label q of the first node is a state in |a| — |p|), where v denotes the 

first node of p. As in the previous case, the state of the ending node is S{q,A). 
To avoid redundant work, each of these short paths can be produced only once 
during one execution of RS. 

The case Action = Error is simulated by simply deleting the corresponding 
path. The algorithm terminates when either the complete input string has been 
read or every path has been deleted. In the latter case, the part of the input 
string read so far is the longest prefix representing a substring. 

Figure 2 shows the development of the graph when parsing the substring ab. 
In this simple example, each tree always consists of only one single path. The 
vertices with index T are the root nodes. 




In order to obtain a linear-time complexity it is essential for the root nodes 
of the trees to be labelled differently, and this is necessary for the children of 
any node as well. Assume there are two trees with this property whose root 
nodes v\ and V 2 are labelled with the same state. As mentioned above, a certain 
set of configurations is represented by these trees. The following procedure then 
merges them into one tree such that the resulting tree has vi as its root node and 
represents the same set of configurations. The procedure uses the fact mentioned 
earlier that a path p can be removed from the graph if there exists another path 
such that its labelling is a suffix of l{p). 
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Procedure MergeTrees(wi , W2) { 

If v\ and V2 both have children Then { 

Disconnect all children from V2 and delete W2; 

For each former child C2 of V2 Do { 

If v\ has a child ci such that l{ci) = l{c2) Then 
MergeTrees(ci , C2) 

Else Let C2 become a child of vi; 

} 

} Else Remove both trees completely except for the single node i>i; 



As an example, let us assume we have an LR-DFA with at least nine states, 
and the two trees on the left side in Fig. 3 have been generated at some time. 
The result from merging them is then shown on the right side. 




Fig. 3. An example for merging two trees 



The given facts lead to the following algorithm. It determines the index j 
of the maximum prefix a\ . . .aj G SS{G) of the input string ai . . . a„ € A'+. 
Some program lines are marked with a bar (|) on their left sides. These lines 
correspond to the extension that condenses the input string and are explained 
later in Section 6. For the moment, we simply assume that these lines are not 
present. 

1 Procedure ShiftCommonSymbol (SCS) { 

M := 0 ; 

I If |T| = 1 Then /i := ft + 1 ; 

For V G T Do { (* Extend all paths ending in v *) 

5 If no node w G M \s marked with 5 {l{v),ai), create and add it to M; 

Let V become a child of w; 

} 

T := M- i ■■= i + 1; 

} (* End of SCS *) 

10 

Procedure ReduceStacks (RS) { 

M := 0 ; For {q,q') G Q X {Q\ {go}) Do Flag := False; 

While 3 v G T Do { (* Process all paths ending in v *) 

T:=T\{v}; 
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15 If Action(l(v), Ui) = Reduce{A a) Then { (* Reduce paths *) 

I If h > |a| And T U M = 0 Then { 

I Replace the |a| symbols in front of Oi by the single symbol A; 

I li i ^ i' Then { i' ~ i\ ClearStack(R'); } 

I Push(A, A ^ a)', h := h — \a\ + 1', 

I } Else h := 0; 

R ■- {u}; j ■- |a|; 

While j > 0 And i? yf 0 Do { (* Shorten all paths by at most |a| *) 

T := T U {(v,j) I u € R is a leaf}; 

R' := {u I u is a child of some node in R }; 

25 Remove all connecting edges between nodes in R and R'; 

R := R'- j := j - 1; 

} 

For {w,j) e T Do { (* Generate new short paths *) 

For q G S~^{l{w),j) Do If Not Flag Fg^s(q,A) Then { 

30 Flag Fg^s(q.A) ■■= True; 

Create a new node v labelled with q and add v to R; 

} 

Remove node w; 

} 

35 For w G R Do { (* Add the new state to all paths *) 

Let w become the child of a new root node v labelled with S{l{w), A); 
If 3v' GTUM: l{v') = 6{l{w), A) Then 
MergeTrees(ii', v); 

Else T :=T U {u}; 

40 } 

} Else If Action{q, Ui) = Error Then { 

Remove the tree rooted in v completely from the graph; 

} Else (i.e., Action{q, at) = Shift) M ~ M U {u}; 

} 

45 T ■- M; 

} (* End of RS *) 
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h ■.= 0; ClearStack(A'); (* Beginning of Main Program *) 
j:=l;T:= 0; 

For each q G Q with Action{q,ai) = Shift Do 

Create a new node v labelled with q and add v to T; 
While i < n And T / 0 Do { SCS; RS; } 

If T = 0 Then { 

While Not Empty(A') Do { 

Pop the top element A ^ a from K; 

Replace the single symbol A in front of Oi by a; 

} 

Return (i — 1); 

} Else Return(n); (* End of Main Program *) 
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4 Correctness 

We now prove the correctness of the algorithm. 

Lemma 1 The above assured properties of the graph (i.e., it consists of trees, 
each path consists of at least two nodes, the children of each node are differently 
labelled, and so are the root nodes contained in T ) always hold before and after 
executing SCS or RS. They also hold whenever line 13 is reached (i.e., the main 
loop of RS is about to be executed) except for the fact that the root nodes are 
then contained in TU M and not in T alone. 

Proof. By an easy analysis of SCS and RS combined with a simple induction. □ 

For the next lemma, we need a definition that describes the history of a path 
during the execution of the algorithm. More precisely, we inductively define an 
(toi, . . . , mk)-path as follows, where toi, . . . , rrik G {S} U P. An (e)-path is a 
path generated during the ForMoop of the main program. An (mi, . . . , mk, S)— 
path is an (mi, . . . ,mfe)-path lenghtened by one node due to one execution of 
SCS. Finally, all paths resulting from an (mi, . . . , mfc)-path p due to the simu- 
lation of a reduction corresponding to a rule A ^ a are (mi, . . . , mk, A a)- 
paths. These paths are generated (possibly among others) during one execution 
of the lines 21-40. Clearly, this implies that the ending node u of p satisfies the 
condition in line 15. Also, note that mi is always equal to S because SCS is the 
first procedure being called. 

Lemma 2 Let p be an (mi, . . . ,mr)-path and assume SCS has been executed i 
times. Then there exists a partial stack s such that 

(s, oi . . . a„) ... {l{p),ai+i . . .On) . 

Proof. If r = 0, then i = 0 and we can choose s := l{p). For the induction 
step r — > r -I- 1, let p' be the (mi, . . . ,mr)-path from which p has been con- 
structed. From the induction hypothesis we know there exists a partial stack s' 
such that (s', ai . . . a„) p^ . . . p^ {l{p'), av+i . . . a„), where i' is the number 
of executions of SCS before the creation of p. 

We first assume m^+i = S. Then p has been created due to a call to SCS and 
p' existed before the execution of this procedure. Let v be the ending node of p. 
Obviously, from the For-loop in the main program and from the management 
of M in RS, Action{l{v) , Or +i) = Shift. Hence, 

{l{p'),ae+i . . . a„) p^ {l{p')S{l{v), au+i), au +2 ■ • ■ a„) . 

Clearly, from the way SCS works, l{p) = l{p')S{l{v),ar+i). Since t = z' -I- 1, we 
thus obtain our desired result by choosing s := s' . 

We now consider the case m^+i = A ^ a. Before continuing, we recall the 
following property of an LR-DFA: 
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Lemma 3 Let q G Q, [A ^ Xi . . . Xi,Xi+x . . .Xn,{w}\ G q, j < i, and let 
q' G 6~^{q,j). Then [A ^ Xi . . . Xi_j,Xi_j+i . . . {w}] G q' and 

S{q',X,_,+i...X,) = q . 

Proof. Immediate from the definition of and the construction of an LR- 
DFA. □ 

Now let j := \p'\, and let r;i, . . . , Vj+\ be the nodes of p' . From the definition 
of p, Vj+i has fulfilled the condition in line 15, i.e., Action{l(vj+i), ai>+i) is 
equal to Reduce{A a) and therefore [A — > a,,{ai/+i}] G l{vj+i). Assuming 
j ^ I a I, we know that only p is generated from p' , labelled with 

l{p) = l{vi) . . . l{Vj + i_^a\)5{l{Vj + i_la\), A) . 

From Action{l{vj.^-l), Oi'+i) = Reduce(A — > a) we also know that 
(l(p'),ai>+i . . . o„) = (l(vi) . . .l(vj+i),a^'+i ...an) 

(l(vi). . .l{Vj + i_lal)S{l{Vj + i_lal),A),ai> + i. . .an) = (l{p),ai> + i. ..an) . 

Since i = i', we again simply choose s := s'. (In Lemma El we shall see the 
importance of preserving s'.) 

We now assume j < \a\. Then p is a short path and consists of two 
nodes labelled with q and 5{q,A), where q is in |a| — j). Assume 

a = Ai...A|„|. Since l{vi) G S~'^{l{vj+i)J) and [A a,,{ae+i}] G l{vj+i), 

LemmaEI shows that [A ^ Xi . . . X^a\-j»^\a\-j+i ■ ■ ■ {oi'+i}] is contained 

in l{vi). Similarly, q G |a| — j) implies [A ,a, {oi'+i}] G q. Let 

tm ■= S{q, Xi . . . Xm-i), m = 1, . . . ,\a\ — j. By applying Lemma El again, 
^(^|ck| — J J ■ ■ ■ -^\a\—j ) ^ (^1 ) ■ Thus t\ . . . / (ui ) . . . l(^Vj-\-i) 

is a partial stack. Hence, by the induction hypothesis, 

(ti . . . ai . . . a„) (ti • ■ • t|Q,|_jZ(p'), Oi'+i . . . a„) . 

(Note that ti . . . t\a\~js' is a partial stack as well because a parsing step never 
changes the state at the stack bottom. Thus, the first state of s' is equal to 
l{vi).) Since Action{l{vj+i), ai>+i) = Reduce{A a), we also have 

(ti . . . t|ct|_j/(p ), Ui'-i-i . . . Un) I (t\S(ti^ ^)'! + l ■ ■ ■ an) . 

But ti is equal to q and thus, tii5(ti,A) = l{p). We therefore choose s to be 
equal to . . . t\a\_js'. □ 

Lemma 4 Let (si...Sr,az) be some partial configuration which satisfies the 
condition Action(sr, a) = Shift. Then there exists a string x G S* and a stack t 
such that (go, xaz) \ — * (tsi . . . Sr, az). 

Proof. Let t be a path in the LR-DFA from go to some state g G (5“^(si, 1) (if 
Si = go, then let t := e). Then tsi . . . Sr is a path from go to Sr, i.e., (p(tsi . . . Sr) 
is a viable prefix of the grammar. Since the transition S{sr,a) is defined, we 
conclude that ip(tsi ■ ■ ■ Sr)a is a viable prefix as well. Hence, it is possible to 
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choose two strings x and y such that S (p{tsi . . . Sr)ay xay. From the 
way an LR parser works, this implies (go, xay) \ — * (tsi . . . Sr,ay). Thus, since 
the LR parser only has an one-symbol-lookahead, {go, xaz) \ — * (tsi . . . Sr, az). 

□ 



Lemma 5 The substring parser algorithm always terminates. 

Proof. Clearly, we only have to show that a call to RS always returns. Assum- 
ing the opposite, the outer While-loop of RS obviously has to run forever. But 
Lemma n shows that I'Ll is bounded by \Q\ at the beginning, and with each pass 
of the loop one node is removed from T in line 14. Thus, since line 39 contains the 
only statement in the loop that adds nodes to T, the part of RS that simulates a 
reduction for all paths endling in some root node v G T (lines 21-40) must be exe- 
cuted infinite many times. Since |T| < |Q|, there is at least one path from which 
infinite many successor paths are generated. Clearly, from the management 
of the flags the total number of short successor paths is bounded by 

IQp — IQI- Let be the first path such that all further successor paths p^,p^, . . . 
are not short paths, p^ itself is an (toi, . . . , mr)-path for some uii , . . . , mr. By 
Lemma |3 we have 3s: (s, ai . . . a„) {l{p°),ai+i . . . a„), where i is 

the number of previous calls to SCS. By reviewing the corresponding part of the 
proof of Lemma El we then conclude that 

3z 3 s Vj : (s,ai . . . a„) 

(z(p'^), Ci+I . . . On) |-^ . . . |-^ {l{p^), Ci+I . . . On) . 

j times 

Since mi = S, we have Action(sr, ai) = Shift, where s = si . . .s,.. Applying 
Lemma 0 yields 

3xBt: (go, xai .. .On) \ — * (ts, ai . . . a„) 

(fZ(/),a,+i...a„)pa . . . pi . . . . 

Thus, with the input string xai . . . a„, the original LR parser would run forever 
as well. This contradiction proves the lemma. □ 



Lemma 6 If (go, xai ■ ■ ■ o-n) | — * (si . . . Sj, Oi+i . . . a„) for some x G E* and i G 
{l,...,n — 1}, then there occurs a path p during the i-th execution of RS such 
that p is labelled with a suffix of si .. . Sj. More precisely, p is contained in the 
graph when reaching line 13 at some point of time. 

Proof. Let r denote the number of parsing steps after shifting oi. If r = 0, 
then Sj-i must be a state with Action(sj-i, ai) = Shift, and Sj must be 
equal to i5(sj_i,ai). Clearly, from the executed program code when reaching 
line 13 for the first time, there exists a path with the labelling sj-iSj, i.e., 
this path satisfies the claim. Concerning the induction step r — > r -|- 1, we 
know that (go, xai . . . a„) | — * (h . . . ti,ai>+i . . . a„) | — (si . . . Sj,ai+i . . . a„), and 
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there exists a path p' when reaching line 13 at some point of time during 
the *'-th execution of RS, where l{p') = for some /. Let v G T 

be the last node of p' . Lemma 0 implies that v is eventually removed from 
T and thus, p is processed at this point of time. Let us assume that the 
(r + l)-th parsing step is a reduction that corresponds to a rule A ^ a. Then 
Action(ti, Ui'+i) = Reduce(A a), si . . . Sj = ti . . . , A) and i = i' . 

The first of these facts implies that if |a| < Z — /, then p' is replaced by a new 
path p labelled with tf . . . , A), thus p satisfies the claim. Before 

reaching line 13 again, a call to MergeTrees possibly replaces p by some shorter 
path labelled with a suffix of l{p), but then the claim still holds. Now let us 
assume |a| > Z — /. Then {v, |a| — Z + /) is contained in T when reaching line 28, 
where v is the first node of p'. Since ti . . ,ti represents a path in the LR-DFA, 
ti-\a\ S |a| — Z + /). Thus, from the management of the flags 

either now a new short path p with Z(p) = Z/_|Q,|<5(Z/_|a| , A) is generated, or such 
a path has already been generated during the current execution of RS some time 
before. In both cases, the claim again holds. 

We now consider the case that the last parsing step is a shift operation. 
Then Action{ti,ai>+i) = Shift, si...Sj = ti . . ai>+i) and i = i' + 1. 

Clearly, from the management of the set M, v G T when RS returns. Thus, 
SCS and RS are both called again. Since SCS converts p' into a path p labelled 
with tf . . . tiS(ti, Oi'+i), the claim is again correct because p is still present when 
entering RS. This completes the proof of the induction step. □ 



Theorem 1. The substring parser algorithm is correct. 

Proof. Let i G {2 , . . . , n}. We first show the following equivalence: 

ai . . . a, e SS{G) 

There exists a path (i.e., T 0) after the {i — l)-th execution of RS. 

If. Let p be an (toi, . . . , mr)-path after the (i — l)-th execution of RS. By 
Lemma El we know that 

3 Si ... Sr : {si . . .Sr,ai . . .an)\^ . . . {l{vi) . . . l{vt), Oi . . . a„) , 

where v\, . . . ,Vt are the nodes of p. Since Vt G T must satisfy the condition 
Action(l(vt), ai) = Shift when leaving RS, we obtain 

3x3t: (go, xai . . . a„) | — * (Zsi . . . Sr, oi . . . a„) 

... pi- (tZ(p),a,...a„)^ ... 

by applying Lemma El Hence, the LR parser successfully reads the prefix 
xai . . .Qi. But an LR parser never shifts an errornous symbol. Thus, xa\ . . .Oi 
is the prefix of some sentence xai ■ ■ ■ o-iD- Hence, a\ . . .Oi G SS{G). 

Only If. Let ai . . .Ui G SS{G), i.e., there exist two strings x and y such that 
xai ■ ■ ■ cay G L(G). Thus, there also exist two stacks s,t such that 

(go, a;ai . . . a*y) | — * {t, aiy) (s, y) | — * . . . 
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We also have (go, a^ai ■ • ■ cLn) \ — * (i, . . . a„) |-^ (s, a^+i . . . a„) because the LR 

parser only has an one-symbol-lookahead. By Lemma there occurs a path p 
during the (i — l)-th execution of RS such that l(p) is a suffix of t. In particular, 
the last node v of p is labelled with the last state of t. Therefore, v fulfills the 
condition Action{l{v) , ai) = Shift and thus, p (or a suffix of it due to some call 
to MergeTrees) is still present when returning from RS. 

Clearly, by an easy analysis of the main program, the proven equivalence 
shows that our substring parser works correctly. □ 

5 Complexity 

We now show that our algorithm runs in linear time relating to the length of the 
calculated prefix of the input string. 

Lemma 7 Let C := {fc | 3 fc', k” : k' |-^ k” \ — * k} he the set of partial eonfigura- 
tions that result from shifting at least one symbol. Then there exists a eonstant 
Ki sueh that the LR parser, starting with any partial eonfiguration eontained 
in C, never performs K\ eonsecutive reduetions without decreasing the initial 
number of elements on the stack. 

Proof. Let k = (s, y) G C. From the definition of C, there exist k' = ft' , ay) and 
k" = {t" ,y) such that k' |-^ k" \ — * k. Moreover, when starting with k, let T(fc) 
denote the number of reductions performed by the LR parser until either the 
next symbol is shifted, the stack would contain less than |s| states after the next 
reduction, or an error occurs. T{k) must be a finite number because otherwise, 
by Lemma 0 

3x3t: {qo,xay)\ — * {tt' , ay) {tt" , y) \ — *{ts,y)\^ . . . . . 

Hence, in contrast to its properties, the LR parser would never terminate when 
parsing the string xay. 

Now let s = s\ ... Sr. Since a reduction corresponding to a rule H — > a at 
first removes |a| states from the stack and then pushes a new state on its top, 
none of the rfk) reductions discards any of the states si,...,Sr-i during its 
removal phase because otherwise, the resulting stack would be shorter than the 
initial one. In particular, si, . . . , Sr -2 are never used for determining the pushed 
states. Thus, only Sr-i and Sr have an influence over r(k). Furthermore, except 
for the first symbol of y, rfk) is not affected by y either because the LR parser 
only has an one-symbol-lookahead. Hence, we can choose 

Ki := max{r(A:) | fc G C n ((Q U Q^) x L')} + 1 . 

The set {Q U Q^) x L" is finite, and thus, so is Ki. □ 

Lemma 8 Let L := R \ (T U M) denote the set of internal nodes when reaching 
the start of the outer While-Zoop of RS (line 13) at some point of time. Then 
at least one node is removed from L after at most K 2 ■= \Q\ • K\ + 1 additional 
executions of this loop. 



On Parsing and Condensing Substrings of LR Languages in Linear Time 



35 



Proof. Let us first assume that a complete tree is deleted (line 41) during the 
next K 2 executions. This implies that at least one path is completely removed 
from the graph. Since the first node of a path is always internal, the lemma is 
proven in this case. 

By Lemma □ \T\ < IQI and |M| < |Q| and thus, the condition in line 43 
cannot be satisfied more than \Q\ times. Hence, at least iL 2 — |Q| = + l 

reduction simulations are preformed. From the bound on T, there is some path 
p with at least Ki such simulations. Thus, by Lemma 0 p must have been 
shortened in the meantime, and this implies that at least one node v G I must 
have been deleted. □ 

Lemma 9 The difference in the number of internal nodes before and after exe- 
cuting RS is bounded by a constant. 

Proof. Normally, the simulation of a reduction removes more nodes than gener- 
ating new ones. The only reductions that increase the length of a path by one 
node are those corresponding to e-rules., e.g. A e. Thus, at most Ki — 1 such 
nodes can be found at the end of each path because, as already seen, a path is 
getting shorter after at most Ki reductions. From the structure of the graph 
when leaving RS, the ending nodes of all paths are contained in T, the children 
of the ending nodes are children of the nodes in T, and so on. By LemmaQ not 
only T, but also any set of children cannot contain more than \Q\ nodes. Thus 
the number of the above mentioned nodes is bounded by|Q|-|-|(5p-l-----l-|Q|^^“^. 
The at most \Q\ nodes in T are not internal ones, but the former root nodes 
in T before executing RS may now be internal, thus the given number is also 
a bound for the internal nodes. Hence, from the fact that the generation of 
the short paths additionally produces at most 2(|QP — |Q|) nodes, the lemma is 
proven. □ 

Theorem 2. Let j := max{i | ai . . . G SS{G)}. Then the time consumed by 
the algorithm is bounded by 0{j). In particular, the algorithm takes at most 
0{n) time. 

Proof. SCS and RS are called for j — I times. Clearly, a call to SCS increases 
the number of internal nodes by at most |Q|. Together with Lemma 0 this 
implies a total 0{j) bound on the number of new internal nodes after returning 
from SCS and RS. Thus, by Lemma 0 the number of executions of the outer 
While-loop of RS is also bounded by 0{j). Recalling the \Q\ bound on |T| and 
on the number of children of any node, it is easy to see that one instance of this 
loop can be executed in constant time, ignoring the costs of deleting and merging 
trees (lines 38 and 42). Clearly, a call to SCS only takes some constant time as 
well. Thus, there is an 0{j) bound on the total time consumed by the algorithm 
(still ignoring lines 38 and 42) and therefore, we also have a linear bound on 
the total number of all generated nodes. This implies that the deletion of all 
trees can be done in 0{j) time, and moreover, the number of single calls to 
MergeTrees{vi,V 2 ) cannot exceed 0(j) because such a call removes at least the 
node V 2 . This completes the proof on the linear time bound of our algorithm. □ 
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6 Condensation of Substrings 

During the parsing process, an LR parser converts an input word w G L{G) 
step by step into the sentential forms that occur when generating the rightmost 
derivation of w. More precisely, if {qo,xy) \ — * {s,y), then (f{s)y is such a sen- 
tential form, and it is possible to derive x from <p(s). Similarly, the substring 
parser is able to convert the already read prefix x of an input string w into a 
corresponding partial sentential form (3, i.e., [3 =^* x, and there exist some 
a, 7 such that a /?7 G S{G). This condensation of x is advantageous if x must 
be parsed again later. For example, if a; = abode and there are two productions 
A ^ be and B — > Ad which must always be applied to derive x, then the par- 
tial sentential form aBe can be processed more quickly than abode because two 
reductions are already done. Thus, it is useful to replace a substring with its 
condensation whenever possible. Of course, this can only be done if the “stored” 
reductions are always applied to the substring. We therefore restrict the conver- 
sion of X into (3 by the condition \f v,z: vxz G L{G) vf3z G S{G). Thus, we 
do not want a partial sentential form to depend on some special context strings 
V and z. 

The condensation feature is realized by the program lines with a bar ( | ) on 
their left sides. The algorithm as before returns an index j corresponding to 
the calculated prefix ai . . .aj, but moreover, this prefix is now replaced by an 
appropriate condensation of it. The idea which supports the additional lines is 
as follows. Let P^{G) := {v\3z: vai . . . ajz G L{G)}, and let us assume that 
the following C -condition holds for some indices i,j with 1 < i < j < n: 

3si,...,sr G Q Vu G P^~^{G) 3t G Q* 3ki,. .. ,ki G ({tsi}Q+) x T*: 

{ttOi ■ ■ ■ ^n) I ai . . . Un) | k\ \ . . . | ki — (tSi . . . Srf Oj . . . Un) ■ 

Since k\,...,ki G ({tsi}Q+) x S* , it can be easily seen that beginning with 
{tsi,ai . . .On), the state si is never removed from the stack. Therefore these 
transitions only depend on the common state si and not on t. Thus, from the 
way an LR parser works, when parsing a string vai . . .ajZ G L(G), the sub- 
string Oi . . . Oj-i is always reduced to (p{s\ . . . Sr)- (Recall that the parser can- 
not look beyond the symbol aj because of its one-symbol-lookahead.) Hence, 
vai . . . ai-i(p(si . . . Sr)ajz G S(G), and therefore ai . . . ai-i(p(si . . . Sr)aj is a 
condensation of oi . . . aj . 

The condensation algorithm detects whether the above C-condition holds, 
and, if this is the case, changes the input string appropriately by replacing 
Qi . . . aj-i with <^(si . . . Sr)- This replacement is done step by step by applying 
the corresponding reductions to the input list. The following two lemmas show 
how the detection is managed. The first one implies a possibility to test whether 
the C-condition is true and is a refinement of Lemma El 

Lemma 10 Let i G {2, . . . , n}, and let v G if* and y = a\ ... On G be two 
strings such that 

{qo, vai ■ ■ ■ ctn) I — * (t, tti-i . . . On) |-^ ki I — . . . I — ki \-^ (t' , a^+i . . . a„) , 
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where kj = (s-^, . . . a„), for j G {1 , . . . ,1}. Then there exists a path p in the 

graph and an index j G {1, . . . , Z} whenever reaching line 13 during the (i — l)-th 
execution of RS such that l{p) is a suffix of sf 

Proof. Let / denote the total number of executions of the outer While-loop of 
RS, and for 6 G {1, ...,/+ 1}, let G;, be the graph and pi, the corresponding path 
when reaching line 13 for the 6-th time. We prove the lemma by an induction 
on b. Clearly, Gi only consists of paths that result from shifting Oi-i. From 
the proof of correctness, we know that the lemma holds in this case with j = 1. 
Now let us consider the induction step 6 — > 6+ 1. By the induction hypothesis, 
there exists a path ph and an index jb satisfying the lemma. Let v be the ending 
node of pb- During the next execution of the outer While-loop, possibly pb is 
not considered. Then we can choose Pb+i ■= Pb and jb+i '■= jb- If Pb is replaced 
by another path due to the simulation of a reduction, then we can choose this 
path as pb+i and jb+i ■= jb + 1 because l{pb+i) must be a suffix of Note 

that in both cases, a call to MergeTrees possibly deletes Pb+i, but then there 
must be a path that is labelled with a suffix of l{pb+i), and thus we can then 
choose this shorter path as Pb+i- 

Finally, we consider the case that pb is completely removed from the graph 
and there does not exist a suitable path for satisfying the lemma afterwards. This 
cannot result from an execution of line 42 because Action(l(v), Ui) is either Shift 
or Reduce. Thus, this case is only possible if the generation of an appropriate 
new short path pb+i fails due to the violation of the condition in line 29, where 
the two states of Pb+i match the last two states of sA+^. But then pb+i must 
have been created some time before, i.e., there exists some c G {1, . . . , 6} such 
that Pb+i existed in the graph before executing the main loop for the c-th time. 
Now we can restart the induction, beginning with 6 = c. Since this case is only 
possible for at most |Qp — \Q\ times, we finally succeed in proving the existance 
of ph+i and jh+i. □ 

Let us now assume that SCS has been called i — 1 times, and there only exists 
one root node w in the graph when reaching line 13 at some point of time. Then, 
by Lemma E3 for every v G P^{G) there is some path p in the graph such that 
3t' G Q* ■. {qo, vai . . .Ui) \ — * (t'l{p),ai). (Note that ai will be shifted some time 
later because vai ... is a prefix of a sentence of G and the parser only has an 
one-symbol-lookahead.) But every path in the graph ends with si := l{w), and 
thus we know that the following special case of the G-condition is true whenever 
there is only one root node: 

3 Si G Q V G P^{G) 3 1 G Q* : {qo, vai ■ ■ ■ Q-i) \ — * {tsi, Ui) . 

From the discussion at the beginning of this section, it is easy to see that the 
G-condition remains valid as long as the common state si is not popped off 
the stack, i.e., as long as the node w is not removed from the graph. From the 
way the graph is maintained, it then follows that all paths will always end in a 
common part that starts with w, and only this part is altered during this time. 
We now show that the length of this common path is always contained in the 
variable h. 
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Lemma 11 When there is only one root node v in the graph, the variable h 
contains the length of the common suffix of all paths. If there is more than one 
root node, then h is equal to zero. 

Proof, h must be set to zero (line 48) while initializing the graph (lines 50-51) 
because the graph only consists of single nodes. When calling SCS, h must 
be increased by one in the case that there is only one root node because this 
procedure then appends a new node to it. Otherwise, h must remain zero even 
if there is only one root node afterwards (line 3). Now we consider a call to 
RS. Clearly, h must be changed during an execution of the outer While-loop 
only if the condition in line 15 is true. Since a reduction corresponding to a rule 
A ^ a first removes |a| nodes from a path and then creates a new one, h must 
be decreased by |a| — 1 (line 19), but only if |a| < h (line 16). Otherwise, there 
does not exist a common path any more after the removal of the |a| nodes, and 
thus h must be reset to zero (line 20). Note that the condition TUM = 0 in line 
16, which is used to test whether there is exactly one root node in the graph, is 
correct because one node has been removed from T before (see line 14). □ 

We now explain some more of the remaining lines of our condensation al- 
gorithm. We already know that as long as there is only one root node in the 
graph, all reductions that are made in the meantime are common to all paths 
and thus they can be applied to the input string as well. For one single rule 
A ^ a, the corresponding modifications of the input string are done in line 17. 
In order to implement these modifications efficiently, the input string should be 
administered as a double-chained-list. 

As we have mentioned earlier, the common reductions that are applied to 
the input string lead to a valid condensation, but only if the next input symbol 
is shifted some time later. When using a canonical LR parser, we do not have 
to care about this problem because such a parser is known to never perform a 
reduction if the next symbol causes an error. But non-canonical parsers (e.g. 
SLR parsers) may still perform some wrong reductions before detecting the error 
situation. Since the substring parser has no possibility to detect in advance 
whether the next symbol can be shifted or not, it stores every reduction from 
the last shift operation on in a stack K such that these reductions can be undone 
when necessary. More precisely, the stack K is maintained in the following way. 
At the beginning, K is cleared in line 48. Then all reductions that affect the 
input list are pushed onto K in line 19. Since all these reductions are definitely 
correct when the next symbol is shifted, they never need to be undone in this case 
and thus, they can be discarded from K . This is guaranteed by the management 
of the additional index i' in line 18. The additional While-loop in lines 54-57 
restores the part of the list that was changed during the last reductions. Clearly, 
this is only necessary if the condition in line 53 is true because otherwise the 
complete input string has been successfully shifted and therefore all reductions 
are correct. It is easy to see that the loop pops one reduction after another 
from K and restores the list step by step. Therefore, the finally calculated 
condensation is always correct. Note that the rest of the algorithm is not affected 
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by the condensation part because it has no influence over the graph structure. 
Furthermore, from the time complexity of the original algorithm, we know that 
the total number of rules pushed onto K cannot exceed 0{j), where oi . . . Qj is 
the calculated prefix of the input string. The time consumed by the While-loop 
in lines 54-57 is therefore as well bounded by 0{j). Since the additional lines 
16-20 only take some constant time, the total 0{j) time bound of the algorithm 
is still valid. 

When implementing the program, recall that the stack K is not necessary if 
the substring parser uses a canonical LR(1)-DFA. 

7 Recognizing Snbstrings of LR(fc) Languages 

In this section, we first show that with some minor modifications, our substring 
parser can even be used with canonical LR(A:) languages, where fc > 1. Later, we 
also discuss the case of non-canonical LR(fc) languages. Canonical LR(/c) lan- 
guages are easier to handle because their parsers have the valid-prefix property, 
i.e., after parsing the prefix x of some input string xy and assuming that q is the 
current state on the stack top, Action{q, yk) yf Error iff xyk is the prefix of some 
sentence in L{G), where yk denotes the first k symbols of y. In the context of 
our substring recognizing problem, this means that when our algorithm returns, 
the next fc — 1 unshifted symbols also belong to the longest preflx in SS{G). 
(The next k unshifted symbols do not belong to it because they were responsible 
for the Action function to fail.) 

The modified algorithm works as follows. Let w = ai . . . a„ be the input 
string. At first, the maximum preflx a: of ai . . . Ok-i is determined such that 
X € SS(G). If |a;| < k — 1 (or |w| < k), then clearly there is nothing left to do. 
Otherwise, we start our previous algorithm with three modifications. Firstly, 
wherever the function Action is used, the next k symbols must be used for the 
lookahead string. Secondly, the condition i < n in line 52 must be replaced by 
i < n — k + 1. And Anally, the returned index must be increased by fc — 1. This 
leads to the following algorithm. 

j ■- 1; 

While j < k And j < n And ai . . .aj £ SS{G) n Do j := j + 1; 

If j = k And n> k Then { 

Execute the previous algorithm with the above mentioned modifications; 

Return(7 -|- fc — 1); (* where j is returned by the original algorithm *) 

} Else Return(ji — 1); 



The correctness and the linear complexity can be proven in nearly the same 
way as before. Note that the sets SS{G) n A*, for i G {1, . . . , fc — 1}, can be 
precalculated from the canonical LR(fc)-DFA {Q,r,6,qo,F) of G by using the 
equation 

SS{G) n A* := {w G E^\3q G Q3[A a,f3, {x}] G q3y, z G E* : x = ywz} . 

The new algorithm is not correct when using LR-DFAs of non-canonical LR(fc) 
parsers, e.g. SLR(fc) parsers, because these parsers do not have the valid-prefix 
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property. In fact, after the modified original algorithm returns with an index j, 
we only know that ai . . . S SS{G) and a\ . . . aj+k ^ SS{G). However, we are 
able to present a solution to this problem even in this case. 

Clearly, for every w S SS{G) there exists a string y £ such that 

either y G and still wy G SS{G), or y G and wy is a suffix of a 

sentence xwy G L{G). The idea now is to apply the modified algorithm to the 
new input string w' := wy. Note that if y G , then the algorithm accepts at 
least the prefix w because the lookahead of the substring parser always contains 
a substring of wy as long as the last symbol of w has not yet been shifted, and 
this will never happen due to the modified condition i<n— fc + lin line 52. 
In the other case, i.e., y G the substring parser completely accepts wy 

because otherwise the corresponding original LR parser would refuse to accept 
sentences of G that end with wy. Thus the returned pointer j corresponds to 
the last symbol of w or to one symbol of y iff w G SS{G). Unfortunately, since 
y is unknown, we in general have to perform this test for all strings y G 

By using this method, we can check whether w := a\ . . .aj+\^k/ 2 \ is a sub- 
string or not. Clearly, with additional 0(logA:) interval halvings we can then 
determine the exact solution. Note that while the complexity of this algorithm 
is still 0{j), there are possibly 0(2^ log /c) executions of the original algorithm, 
and thus this method does not seem to be practical if k is not small. 

8 Conclusion 

We have presented a linear-time bounded algorithm for recognizing and con- 
densing substrings of LR(fc) languages. Practical experience has shown that this 
substring parser is nearly as fast as the corresponding normal LR parser. The 
substring parser has primarily been developed in order to generate a new algo- 
rithm for syntax error correction and recovery. This algorithm depends on the 
ideas of Richter jS| and divides an incorrect program into several parts such that 
on one hand each part contains at least one syntax error, but on the other hand 
a shorter substring of any part does not contain any syntax errors. This is easily 
done by firstly determining the longest error-free prefix of the program, and 
then secondly using the substring parser on the rest of the program to calculate 
the next part. The second step is then repeated until the complete program is 
analysed. Usually, the syntax errors can be found at the borders of the parts, 
and contrary to Richter’s opinion, numerous tests with sample programs that 
contained many different errors have shown that it is possible to obtain very good 
corrections by using the substring parser on three or more successive parts, where 
one part contains a trial correction and the other ones supply some context in- 
formation. The length of the determined prefix then represents the quality of 
the tested correction. By condensing the different parts with the extension pre- 
sented in Section 4, it is even possible to give the programmer an overview of 
the structure of his program. For example, running the substring parser with a 
standard PASCAL grammar and the input string 

If i = 1 Then i ■.= i + l\ Else i := 0; 
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generates the following output: 

If BooleanCondition Then Statement ; 

Since this is the longest prefix of the input string that represents a substring 
of some correct PASCAL program, the programmer knows that an error occurs 
when appending the keyword Else (a semicolon followed by Else is incorrect). 
Also, the condition and the statement which are both meaningless in this special 
form are replaced by their more abstract grammar variables. 

The resulting algorithm is fast and has several advantages over other correc- 
tion methods, e.g. the advantage of never detecting spurious errors. Details will 
be published elsewhere. 

A The Problems in the Substring Parser of Bates and 
Lavie 

In [2], Bates and Lavie also present a substring parser for LR grammars. But 
unfortunately, both the correctness and the linear complexity are proven incor- 
rectly, as we shall now demonstrate (familiarity with Q is assumed) . 

We again use the SLR(l) grammar given in the second section as an example. 
Let c = (q3,b) be a partial configuration. Since there is only one path from go 
to qs in Fig. 1 , there exists as well only one configuration such that c is an inner 
part of it. This configuration is (90919293, b) and results from shifting the symbol 
“a” for three times: 

(qo,aaab) (qoqi,aab) ( 9 o 9 i 92 ,a&) ( 9 o 9 i 9293 ,^) • 

Clearly, the next configuration then results from a reduction corresponding to 
the rule A — > aa: 

( 90919293 , 6 ) H^=^( 9 o 9 i 95 , 6 ) • 

By using the notions introduced in 0 , these facts can be written as 

c= ([93], 6 , 1 ) , M(c) = ([go, 9 i, 92 , 93 ],aao 6 , 4 ) , 
next{M{c)) = {([go, 9i, 9s], aao6, 4 )} . 

Now we determine the set next{c). Clearly, LONG(A) = {94,95}. Since the 
right hand side of the rule A — > aa is longer than the current stack in c, we 
conclude that next{c) results from a long reduction. By Definition 6 in | 2 |, we 
have next{c) = { ([94], 6, 1 ), ([95], 6, 1 )}. There again is only one path from 90 to 
94, namely 9094, and only one path from 90 to 95, namely 909195. Moreover, the 
corresponding configurations may only result from the following parsing steps: 

(qo,aab) [^ (9091,06) (909192,6) | (9o94,6) 



(qo,aaab) [^ (9091,006) (9o9i92,o6) [^ (go9i9293,6) [ 



(909195, 6) . 
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Thus, we have M{next{c)) = {([go, 94 ], 3), ([go, 9 ij < 75 ], aaa 6 , 4)}. The Sim- 

ulation Lemma (Lemma 6) claims that M(next{c)) = next(M(c)), where C 
denotes any set of stack configurations. Therefore, with C := {c}, this lemma is 
obviously wrong. But then the complete proof of correctness is no longer valid, 
either. 

The complexity analysis given in ^ is correct for grammars without e -rules, 
i.e., rules of the form A ^ e. But Section 4.2, where the analysis is extended 
to grammers which include such rules, contains a severe error. Lemma 15 states 
that in every sentence of any LR grammer G, the number of hidden epsilons 
between two nonepsilon terminal symbols is always bounded by some constant 
that only depends on G. But it is possible to present a counterexample. Let G 
be the following grammar: 

S — > Ah A — > aAB \ c B — > e . 

G is LR(0), and it is easy to see that L{G) = {a^ce^b \ A: G No}. Clearly, this 
contradicts Lemma 15. But then the rest of the complexity analysis is also 
incorrect . 
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Abstract. A cover-automaton A of a finite language L C S* is a finite 
automaton that accepts all words in L and possibly other words that are 
longer than any word in L. A minimal deterministic cover automaton 
of a finite language L usually has a smaller size than a minimal DFA 
that accept L. Thus, cover automata can be used to reduce the size 
of the representations of finite languages in practice. In this paper, we 
describe an efficient algorithm that, for a given DFA accepting a finite 
language, constructs a minimal deterministic finite cover- automaton of 
the language. We also give algorithms for the boolean operations on 
deterministic cover automata, i.e., on the finite languages they represent. 



1 Introduction 

Regular languages and finite automata are widely used in many areas such as 
lexical analysis, string matching, circuit testing, image compression, and parallel 
processing. However, many applications of regular languages use actually only 
finite languages. The number of states of a finite automaton that accepts a finite 
language is at least one more than the length of the longest word in the language, 
and can even be in the order of exponential to that number. If we do not restrict 
an automaton to accept the exact given finite language but allow it to accept 
extra words that are longer than the longest word in the language, we may obtain 
an automaton such that the number of states is significantly reduced. In most 
applications, we know what is the maximum length of the words in the language, 
and the systems usually keep track of the length of an input word anyway. So, 
for a finite language, we can use such an automaton plus an integer to check the 
membership of the language. This is the basic idea behind cover automata for 
finite languages. 

Informally, a cover-automaton A of a finite language L C Y* is a finite 
automaton that accepts all words in L and possibly other words that are longer 
than any word in L. In many cases, a minimal deterministic cover automaton 
of a finite language L has a much smaller size than a minimal DFA that accept 
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of Canada grants OGP0041630. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 43-^21 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 



44 Cezar Campeanu, Nicolae Santean, and Sheng Yu 

L. Thus, cover automata can be used to reduce the size of automata for finite 
languages in practice. 

Intuitively, a finite automaton that accepts a finite language (exactly) can 
be viewed as having structures for the following two functionalities: 

1. checking the patterns of the words in the language, and 

2. controlling the lengths of the words. 

In a high-level programming language environment, the length-control function 
is much easier to implement by counting with an integer than by using the 
structures of an automaton. Furthermore, the system usually does the length- 
counting anyway. Therefore, a DFA accepting a finite language may leave out 
the structures for the length-control function and, thus, reduce its complexity. 

The concept of cover automata is not totally new. Similar concepts have 
been studied in different contexts and for different purposes. See, for example, 
pzmn] . Most of previous work has been in the study of a descriptive complexity 
measure of arbitrary languages, which is called “automaticity” by Shallit et al. 
m- In our study, we consider cover automata as an implementing method that 
may reduce the size of the automata that represent finite languages. 

In this paper, as our main result, we give an efficient algorithm that, for 
a given finite language (given as a deterministic finite automaton or a cover 
automaton), constructs a minimal cover automaton for the language. Note that 
for a given finite language, there might be several minimal cover automata that 
are not equivalent under a morphism. We will show that, however, they all have 
the same number of states. 

2 Preliminaries 

Let r be a set. Then by we mean the cardinality of T. The elements of T* 
are called strings or words. The empty string is denoted by A. If w G T* then 
|w| is the length of x. 

i i-i 

We define T' = {w G T* | |w| = 1 }, = (J T\ and = [J T\ We say 

i— 0 i— 0 

that a: is a prefix of y, denoted x y,\i y = xz for some z G T*.The relation 
is a partial order on T* . If T = {ti, . . . ,tfe} is an ordered set, /c > 0, the 
quasi-lexicographical order on T*, denoted is defined by: x y iff |a;| < |y| 
or \x\ = \y\ and x = ztiV, y = ztjU, i < j, for some z,u,v G T* and I < i, j < fc. 
Denote x < y x ^ y or x = y. 

We say that a; is a prefix of y, denoted x y, if y = xz for some z G T. 

A deterministic finite automaton (DFA) is a quintuple A = (S,Q,qo,6,F), 
where S and Q are finite nonempty sets, qo G Q, F C Q and S : Q x S — > Q 
is the transition function. We can extend 6 from Q x E to Q x E* by 

<5(s, A) = s 

S{s,aw) = S(S(s,a),w). 



We usually denote (5 by i5. 
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The language recognised by the automaton A is L(A) = {iv € IJ* | i5(goj tt>) G 
T’}. For simplicity, we assume that Q = {0, 1, . . . , #Q — 1} and go = 0. In what 
follows we assume that i5 is a total function, i.e., the automaton is complete. 

Let I be the length of the longest word(s) in the finite language L. A DFA 
A such that L{A) n = L is called a deterministic finite cover- automaton 
(DFCA) of L. Let A = {Q, S,S,0, F) be a DFCA of a finite language L. We say 
that A is a minimal DFCA of L if for every DFCA B = (Q' , E, S\0, F') of L 
we have < ifQ'- 

Let A = (Q, E, <5, 0, F) be a DFA. Then 

a) g G Q is said to be accessible if there exists w G E* such that 6{0,w) = g, 

b) g is said to be useful (coaccessible) if there exists w G E* such that 
S{q, w) G F. 

It is clear that for every DFA A there exists an automaton A' such that L{A') = 
L{A) and all the states of A! are accessible and at most one of the states is not 
useful (the sink state). The DFA A! is called a reduced DFA. 

In what follows we shall use only reduced DFA. 

3 Similarity Sequences and Similarity Sets 

In this section, we describe the L-similarity relation on E * , which is a generali- 
sation of the equivalence relation {x =l W- xz G L iS yz & L for all z G A"*). 
The notion of L-similarity was introduced in |7j and studied in 0] 6tc. In this 
paper, L-similarity is used to establish our algorithms. 

Let E be an alphabet, L C E* & finite language, and I the length of the 
longest word(s) in L. Let x,y G E*. We define the following relations: 

(1) a; y if for all z G E* such that \xz\ < I and \yz\ < I, xz G L iS yz G L; 

(2) X y if a: y does not hold. 

The relation is called similarity relation with respect to L. 

Note that the relation is reflexive, symmetric, but not transitive. For 
example, let E = {a, 5} and L = {aab,baa,aabb}. It is clear that aab aabb 
and baa aabb, but aab /l baa. 

The following lemma is obvious: 

Lemma 1 Let L C E* be a finite language and x,y,z G E* , \x\ < |y| < \z\. 
The following statements hold: 

1. If X y, X 2 , then y 2- 

2. If X y, y Z, then x 

3. If X y, V'/'lZ, then a;/^z. 

If X 'fir y and y z, we cannot say anything about the similarity relation 
between x and z. 

Example 1. Let x,y,z G E*, |a;| < |y| < |z|. We may have 

1) xft^y^ y z and x or 

2) xql^y, y z and x/^z. 
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Indeed, if L = {aa, aaa, bbb, bbbb, aaab} we have 1) if we choose x = aa, y — bbb, 
z = bbbb, and 2) if we choose x = aa, y = bba, z = abba. 

Definition 1. Let L € S* be a finite language. 

1. A set S C E* is ealled an L-similarity set if x y for every pair x,y G S. 

2. A sequenee of words [a;i, . . . ,Xn] over S is ealled a dissimilar sequenee of L 
if Xi /l Xj for each pair i,j, 1 < i, j < n and i ^ j. 

3. A dissimilar sequence [x\,... ,x„] is called a canonical dissimilar sequence 
of L if there exists a partition tt = {^i, . . . , Sn} of E* such that for each i, 
1 < i < n, Xi G Si, and Si is a L-similarity set. 

4-. A dissimilar sequence ,Xn] of L is called a maximal dissimilar se- 

quence of L if for any dissimilar sequence [y\, . . . ,ym] of L, m < n. 



Theorem 1. A dissimilar sequence of L is a canonical dissimilar sequence of L 
if and only if it is a maximal dissimilar sequence of L. 

Proof. Let L be a finite language. Let ,Xn] be a canonical dissimilar 

sequence of L and tt = {S'!, . . . , S'^,} the corresponding partition of E* such that 
for each i, 1 < i < n. Si is an L-similarity set. Let [yi, . . . ,ym] be an arbitrary 
dissimilar sequence of L. Assume that m > n. Then there are yi and yj, i ^ j, 
such that yi,yj G Sk for some k, 1 < k < n. Since Sk is a L-similarity set, 
yi yj- This is a contradiction. Then, the assumption that m > n is false, and 
we conclude that [xi, . . . , Xn] is a maximal dissimilar sequence. 

Conversely, let [xi, . . . , Xn] a maximal dissimilar sequence of L. Without loss 
of generality we can suppose that |xi | < . . . < \xn\. For i = 1, . . . ,n, define 

Xi = {y G E* \y and y ^ Xj for j < i}. 

Note that for each y G E* , y Xi for at least one i, I < i < n, since [xi, . . . , x„] 
is a maximal dissimilar sequence. Thus, tt = {Xi, . . . , X„} is a partition of E* . 
The remaining task of the proof is to show that each Xi, 1 < z < n, is a similarity 
set. 

We assume the contrary, i.e., for some i, 1 < i < n, there exist y, z G Xi such 
that y'/'kZ. We know that Xi y and Xi ^ l z hy the definition of Xi. We have 
the following three cases: (1) \xi\ < \y\, |z|, (2) \y\ < \xi\ < |z| (or |z| < \xi\ < \y\), 
and (3) \xi\ > \y\, \z\. If (I) or (2), then z/ z by LemmaQl This would contra- 
dict our assumption. If (3), then it is easy to prove that y / Xj and z Xj, for all 
j yf i, using Lemma E and the definition of Xi. Then we can replace Xi by both 
y and 2 to obtain a longer dissimilar sequence [x\, . . . , Xi-\,y, z, Xi+i, . . . , x„]. 
This contradicts the fact that [xi, . . . , Xi-\,Xi, Xi+\, . . . , Xn] is a maximal dis- 
similar sequence of L. Hence, y z and Xi is a similarity set. 

Corollary 1. For each finite language L, there is a unique number N(L) which 
is the number of elements in any canonical dissimilar sequence of L. 
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Theorem 2. Let Si and S2 be two L-similarity sets and xi and X2 the shortest 
words in S\ and S2, respectively. If x\ X2 then Si U S2 is a L-similarity set. 

Proof. It suffices to prove that for an arbitrary word yi € Si and an arbitrary 
word y2 C S2, yi V2 holds. Without loss of generality, we assume that 
|a^i| < \x2\. We know that |a;i| < |?/i| and \x2\ < |t/ 2 |- Since xi X2 and 
X2 U2, we have xi 1)2 (Lemman(2)), and since xi yi and xi y2, 
we have yi 2/2 (Lemman(l))- 

4 Similarity Relations on States 

Let A = (Q, LI, (5, 0, F) be a DFA and L = L{A). Then it is clear that if i5(0, x) = 
<5(0, y) = q for some q G Q, then x =l y and, thus, x y. Therefore, we can 
also define equivalence as well as similarity relations on states. 

Definition 2. Let A = (Q, L7, <5, 0, F) be a DFA. We define, for each state q G Q, 

level{q) = min{\w\ \ S(0,w) = q}, 

i.e., level{q) is the length of the shortest path from the initial state to q. 

Definition 3. Let A = {Q, S,S,0, F) be a DFA and L = L{A). We say that 
p =A q (state p is equivalent to q in A) if for every w G S* , S(s,w) G F iff 
S{q, w) G F. 

Definition 4. Let A = (Q, S,S,0, F) be a DFCA of a finite language L. Let 
level{p) = i and level{q) = j, m = max{5, j}. We say that p q (state p is 
L-similar to q in A) if for every w G 6{p,w) G F iff 6{q,w) G F. 

If A = {Q, S, (5, 0, F) is a DFA, for each q G Q, we denote XA{q) = min{r(; | 
<5(0, w) = q}, where the minimum is taken according to the quasi-lexicographical 
order, and LA{q) = {u> G E* \ 6{q,w) G F}. When the automaton A is under- 
stood, we write Xq instead of XA{q) and Lq instead LA{q). 

Lemma 2 Let A = (Q, S, 6, 0, F) be a DFCA of a finite language L. Let x,y G 
E* such that <5(0, x) = p and <5(0, y) = q. If p '^a q then x y- 

Proof. Let level{p) = i and level{q) = j, m = maxji, j}, and p 9- Choose an 
arbitrary w G E* such that < I and \yw\ < 1. Because i < |a;| and j < |y| it 
follows that |w| < Z — TO. Since p q we have that 6{p, w) G F iS 6{q, w) G F, 
i.e. <5(0, sw) G F iff <5(0, yw) G F, which means that xw G L{A) iff yw G L{A). 
Hence a; y. 

Lemma 3 Let A = {Q, E, <5, 0, F) be DFCA of a finite language L. Let level{p) = 
i and level{q) = j, m = max{i,j}, and x G E'‘ , y G E^ such that <5(0, a;) = p 
and <5(0, y) = q. If x y then p q. 
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Proof. Let x y and w £ If 6{p,w) £ F, then 6{0,xw) £ F. Because 

a; y, it follows that i5(0, yw) £ F, so 5{q, w) £ F. Using the symmetry we get 
that p^Aq- I 

Corollary 2. Let A = {Q, S,S,0, F) be a DFCA of a finite language L. Let 
level{p) = i and level{q) = j, m = max{i, j}, and x\ £ U®, yi £ , X 2 ,y 2 £ 

E* , such that <5(0, a;i) = <5(0, a; 2 ) = p and <5(0, yi) = <5(0, 1 / 2 ) = 9- If xi yi 
then X 2 y 2 - 

Example 2. If a;i and y\ are not minimal, i.e. \x\\ > i, but p = <5(0, xi) or 
|yi| > j) but q = <5(0, yi), then the conclusion of Corollary is not true. 

Let L = {a, b, aa, aaa, bab}, so I — 3 (Figure Q]). 




we have that 6 bab, but b'/'j^a. 

Corollary 3. Let A = {Q, E,6,0, F) be a DFCA of a finite language L and 
p,q & Q, p q. Then Xp Xq iff p 9- 

If p 9, and level{p) < level{q) and q G F then p G F. 

Lemma 4 Let A = (Q, E, <5, 0, F) be a DFCA of a finite language L. Let s,p,q G 
Q such that level{s) = i, level{p) = j, level{q) = k, i < j < k. The following 
statements are true: 

1. If s ^AP, S ^A q, then p 9- 

2. If s '^AP, P q, then s 9- 

3. If s p, pt^aQ! then S'fij^q. 

Proof. We apply Lemma ^ and Corollary 0 

Lemma 5 Let A = {Q, E,S,0,F) be a DFCA of a finite language L. Let 
level{p) = i, level{q) = j, and m = max{5,j}. Lf p q then Lp C = 

Lq n and Lp U Lq is a L- similarity set. 
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The proof is left to the reader. The next lemma is obvious. 

Lemma 6 Let A = {Q, S,S,0, F) be a DFCA of a finite language L. Let i = 
level{p) and j = level{q), i < j. Let p q.Let w = wi...Wn € and 
Pi = i5(0, wi . . . Wi), 1 <i <n. Then w G L iff XkWk+i ■ ■ . G L for 1 < k < n. 



Lemma 7 Let A = {Q, E, S, 0, F) be a DFCA of a finite language L. Lf p q 
for some p,q G Q, i = level{p), j = level{q) and i<j,p^q, q^O. then 
we can construct a DFCA A! = (Q', i7, 5', 0, T’') of L such that Q' = Q — {q}, 
F' = F — {q}, and 

[p S[s,a) = q 

for each s G Q' and aG E. Thus, A is not a minimal DFCA of L. 

Proof. It suffices to prove that A' is a DFCA of L. Let I be the length of the 
longest word(s) in L and assume that level{p) = i and level{q) = j, i < j- 
Consider a word w G E-^ . We now prove that w G L iS <5'(0, w) G F' . 

If there is no prefix w\ of w such that <5(0, wi) = q, then clearly (5'(0, w) G F' 
iff (5(0, w) G F . Otherwise, let w = W\W 2 where w\ is the shortest prefix of w such 
that 5(0, wi) = q. In the remaining, it suffices to prove that 5'{p,W2) G F' iff 
6{q, W 2 ) G F. We prove this by induction on the length of W 2 - First consider the 
case |w 2 | = 0, i.e., W 2 = A. Since p q, p G F lA q G F . Then p G F' lA q G F 
by the construction of A'. Thus, 5'{p,W2) G F' iff 5(g, W 2 ) G F. Suppose that 
the statement holds for |w 2 | < I' for I' < I — \wi\. (Note that ^ — |wi| < ^ — j.) 
Consider the case that |w 2 | = I'. If there does not exist u G E~^ such that 
u Ap W 2 and S{p,u) = q, then S{p,W 2 ) G F — {q} iff S{q,W 2 ) G F — {g}, i.e., 
S'{p,W 2 ) G F' iff S{q,W 2 ) G F. Otherwise, let W 2 = uv and u be the shortest 
nonempty prefix of W 2 such that S{p,u) = q. Then |u| < I' (and 6'{p,u) = p). 
By induction hypothesis, S'{p,v) G F' iff S{q,v) G F. Therefore, S'{p,uv) G F' 
iff 6{q, uv) G F. 



Lemma 8 Let Abe a DFCA of L and L' = L{A). Then x =l' y implies x V- 

Proof. It is clear that if x =l y then x y- Let I be the length of the longest 
word(s) in L. Let x =l' y. So, for each 2 G E*,xz G L' iff yz G L' . We now 
consider all words z G E* , such that \xz\<l and | yz \< 1. Since L = L' C\ E-^ 
and xz G L' iff yz G L', we have xz G L iS yz G L. Therefore, a; y by the 
definition of ~l- 

Corollary 4. Let A = {Q, E, 6,0, F) be a DFCA of a finite language L, L' = 
L{A). Then p =a q implies p q- 



Corollary 5. A minimal DFCA of L is a minimal DFA. 
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Proof. Let A = (Q, S, S, 0, F) be a minimal DFCA of a finite language L. Sup- 
pose that A is not minimal as a DFA for L{A), then there exists p,q G Q such 
that p =L' q, then p q. By Lemma Q it follows that A is not a minimal 
DFCA, contradiction. 

Remark 1. Let A be a DFCA of L and A a minimal DFA. Then A may not be 
a minimal DFCA of L. 



Example 3. We take the DFA’s of Figure Q 




Fig. 2. Example 



The DFA in Automaton 1 is a minimal DFA and a DFCA of L = {A, a, aa} 
but not a minimal DFCA of L, since the DFA in Automaton 2 is a minimal 
DFCA of L: 



Theorem 3. Any minimal DFCA of L has exactly N{L) states. 

Proof. Let A = {Q, E, S, 0, F) be DFCA of a finite language L, and #Q = n. 

Suppose that n > N{L). Then there exist p,q G Q, p q, such that Xp Xq 
(because of the definition of N{L)). Then p q by Lemma 0 Thus, A is not 
minimal. A contradiction. 

Suppose that N{L) > n. Let [j/i,... ,yN(L)] be a canonical dissimilar se- 
quence of L. Then there exist i,j, 1 < i,j < N{L) and i ^ j, such that 
= <7 for some q G Q. Then yi pj. Again a contradiction. 
Therefore, we have n = N{L). 



5 The Construction of Minimal DFCA 

The first part of this section describe an algorithm that determines the similar- 
ity relations between states. The second part is to construct a minimal DFCA 
assuming that the similarity relation between states is known. 

An ordered DFA is a DFA where S{i,a) = j implies that i < j, for all states 
i,j and letters a. 
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5.1 Determining Similarity Relation between States 

The aim is to present an algorithm which determines the similarity relations 
between states. 

Let A = (U,Q,0,S, F) a DFCA of a finite language L. For each s G Q let 
7 s = min{w | S{s,w) € F}, where minimum is taken according to the quasi- 
lexicographical order. Define Di = {s £ Q \ |7s| = i}, for each i = 0, 1, . . . . 

Lemma 9 Let A = (S, Q, 0, 6, F) a DFCA of a finite language L, and s € Di, 
p £ Dj. If i j then sytp. 

Proof. We can assume that i < j. Then obviously 5{s, 7s) G F and 5{p, 7s) ^ F. 
Since ? > |a;s| + 7 s|, ^ > \xp\ + |7p|, and i < j, it follows that |7s| < |7p|. So, we 
have that |7s| < min(^ — \xs\^l — |a;p|). Hence, s/p. 



Lemma 10 Let A = (Q, if, 0, 6, F) be a redueed ordered DFA aecepting L, p^q £ 
Q ~ {#Q ~ 1}; where ffQ — 1 is the sink state, and either p,q G F or p,q ^ F. 
If for all a G S, 5{p, a) d{q, a), then p^aQ- 

Proof. Let a G S and 5{p,a) = r and S(q,a) = s. If r s then for all |w|, 
|w| < i — max{a;A(s), a;A(r)}, xa{t)w G L iff xa{s)w G L. Using LemmaQ we 
also have: XA{q)aw G L iff xa{s)w G L for all w G E*, |w| < I — |a;A(s)| and 
XA{p)aw G L iff xa{t)w G L for all w G E* , \w\ < I — |j;yi(r)|. 

Hence XA{p)aw G L iff XA{q)aw G L, for all w G E* , \w\ < I — max{|x^(r)|, 
|a;A(s)|}- Because \xA{r)\ < \xA{q)a\ = \xA{q)\ + 1 and | 2 :a(s)| < \xA{p)a\ = 
\xa(p)\ + 1, we get xa{p)o.w G L iff XA{q)aw G L, for all w G E*, |w| < 
I - max{|a;A(p)|, \xA(q)\} ~ 1- 

Since a G U is chosen arbitrary, we conclude that xa{p)w G L iff XA{q)w G L, 
for all w G E*, \w\ < I — max{|a;A(p)|, |2:a(9)|}, i-e. xa{p) ‘^a XA^q)- Therefore, 
by using Lemma 0 we get that p ~a q- 

Lemma 11 Let A = {Q, E,0,S, F) be a reduced ordered DFA accepting L such 
that <5(0, w) = s implies |w| = |a;s| for all s G Q. Let p,q G Q — {ffQ — 1}, where 
ffQ — 1 is the sink state. If there exists a G E such that 5{p,a)'/^ j^5{q,a), then 
p'/'Aq- 

Proof. Suppose that p ~a q- then for all aw G E^~^, S{p, aw) G F’ iff S{q, aw) G 
F, where m = max{ZeueZ(p), ZeueZ((?)}. So 5{5{p,a),w) G F iff 6{6{q,a),w) G F 
for all w G E^~'^~^. Since |a; 5 (p^a)| = |a;p| -I- 1 and |a: 5 (g,a)l = |a;q| -I- 1 it follows 
by definition that 6{p,a) ~a d{q,a). This is a contradiction. 

Our algorithm for determining the similarity relation between the states of 
a DFA (DFCA) of a finite language is based on Lemmas E3 and EH However, 
most of DFA (DFCA) do not satisfy the condition of Lemma El So, we shall 
first transform the given DFA (DFCA) into one that does. 

Let A = {Qa, E,0,Sa, Fa) be a DFCA of L. We construct the minimal 
DFA for the language E-\ B = {Qb,E,0,Sb,Fb) {Qb = {0, . . . , ^ + 1}, 
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<5s(i,a) = 1 + 1 , for all i, 0 < i < I, 6b{1 + 1,o) = ^ + 1, for all a G S, 
Fb = {0, ... ,0)- The DFA B will have exact I + 2 states. 

Now we use the standard Cartesian product construction (see, e.g., 0, for 
details) for the DFA C = {Qc, F,qo,Sc, Fq) such that L{C) = L{A) n L{B), 
and we eliminate all inaccessible states. Obviously, L{C) = L and C satisfies the 
condition of Lemma 

The next lemma is easy to prove and left for the reader. 

Lemma 12 For the DFA C constructed above we have (j>, q) {p,r). 

Lemma 13 For the DFA C constructed above, if Sc{{Q,0),w) = (p,q), then 
|w| = q. 

Proof. We have 6c{{0,0),w) = (p,q), so Sb{0,w) = q so |u>| = q. 

Now we are able to present an algorithm, which determines the similarity 
relation between the states of C. Note that Qc is ordered by that (pa,Pb) < 
{qA, Qb) if PA < qA or pA = qA and pb < qB- Attaching to each state of C is a 
list of similar states. For a,/3 G Qc, if a P and a < ft, then f3 is stored on 
the list of similar states for a. 

We assume that Qa = {0, 1, . . . , n} and n is the sink state of A. 

1. Generate the DFA B for the language B-f 

2. Compute the DFA C such that L{C) = L{A) n L{B) using the standard 
Cartesian product algorithm (see 0 for details). 

3. Compute of C, 0 < * < L 

4. Initialize the similarity relation by specifying: 

- For all {n,p),{n,q) G Qc, (n,p) {n,q). 

— For all {n,l + 1 — i) G Qc, {n,l + 1 — i) o for all a G Dj, j = i, . . . , I, 

0<i<l. 

5. For each Di, 0 < i < I, create a list Listi, which is initialized to 0. 

6. For each a G Qc — {{n, q) \ q G Qb}, following the reversed order of Qc, do 
the following: Assuming a G Di. 

— For each [3 G Listi, if ^c{a, a) ^c a) for all a G E, then a ^c P- 
— Put a on the list Listi. 



Remark 2. The above algorithm has complexity 0((n x Ip), where n is the 
number of states of the initial DFA (DFCA) and I is the maximum accepted 
length for the finite language L. 

5.2 The Construction of a Minimal DFCA 

As input we have the above DFA C and, with each a G Qc, a set Sa = {P G 
Qc I ot P and a < P}. The output is D = (Qb, B,SD,qo, Fd), a minimal 
DFCA for L. 
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We define the following: 

z = 0, gi = 0, T = Qc - S^, (xq = A); 

while (T ^0) do the following: 
i = i + 1-, 

Qi = min{s G T}, 

T = T - Sq,, {xt = min{w | <5c(0,u>) G S^}); 
m = i; 

Then Qd = {go, • ■ ■ , 9m}; go = 0; i5_d(z, a) = j, iff /c = minS'i and Sc{k,a) G 
Sj; FD = {t\S,DFc^ 0}. 

Note that the constructions of Xi above are useful for the proofs in the fol- 
lowing only, where the min (minimum) operator for Xi is taken according to the 
lexicographical order. Let Xi = {{i, s) \ {i, s) G Qc} and at = #W, 0 < * < ^-|- 1. 

Step 1. For all 1 < z < ^ -I- 1 do 6^ = a^, for all (i,j) G Qc do new{{i,j)) = —1. 

Set TO = 0, r = 0 and s = 0. 

Step 2. Put Sm = {{p, g) G Qc I (r, s) (p, g)|- 

Step 3. For all (p, g) G Sm, perform new{{p, q)) = m and bp = bp — 1. 

Step 4. Put TO = TO -I- 1. 

Step 5. While br = 0 and r<Z-|-ldor = r-|-l. Ifr>/-|-1 then go to StepQ 
else go to Step El 

Step 6. Take the state (r, s) G Ar such that new{r, s) yf —1, and s is the 
minimal with this property. Go to Step El 
Step 7. Qd = (0, . . . , to - 1|, F" = (z | new((p,q}) = i, (p,q) G Fc}. For all 
g G Qd and a G S set i5_D(g, a) = — 1- 

Step 8. For all p = 0, . . . , Z -I- 1, g = 0, . . . , rz, (p, g) G Qc and a G Afif 
5D{new{p,q),a) = —1 define 

5D{new{p, q),a) = new{5c{{p, q),a)). 

According to the algorithm we have a total ordering of the states Qc' (p, g) < 
(r, s) if (p, g) = (r, s) or p < r or p = r and q < s. Hence SD{i,a) = j iff 
<5 d(0, xta) = j. Also, using the construction (i.e. the total order on Qc) it follows 
that 0 = |a;o| < ki| < ■ ■ ■ < |a:m-i|- 

Lemma 14 The sequence [xq,xi,... ,Xm-i], constructed above is a cannonical 
L- dissimilar sequence. 

Proof. We construct the sets Xi = {w G E* \ S(0,w) G Si}. Obviously Xi ^ 0. 
From Lemma 0 it follows that Xi is a L- similarity set for all 0 < z < to — 1 . 

Let w G E*. Because {Si)i<i<m-i is a partition of Q, zc G Xi for some 
0 < z < rz — 1, so (Xi)o<ci<n-i is a partition of E* and therefore a cannonical 
L-dissimilar sequence. 

Corollary 6. The automaton D constructed above is a minimal DFCA for L. 

Proof. Since the number of states is equal to the number of elements of a cannon- 
ical L-dissimilar sequence, we only have to prove that L> is a cover automaton 
for L. Let w G E-K We have that Sd{0,w) G Fd iff <5c((0, 0), 'w^) ^ Sf and 
Sf n Fc yf 0, i.e. xj ~c w. Since |zc| < Z, a:/ G L iff zc G L (because C is a 
DFCA for L). 
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6 Boolean Operations 

We shall use similar constructions as in |3| for constructing DFCA of languages 
which are a result of boolean operations between finite languages. The modifi- 
cations are suggested by the previous algorithm. We first construct the DFCA 
which satisfies hypothesis of Lemma ITTI and afterwards we can minimise it using 
the general algorithm. Since the minimisation will follow in a natural way we 
shall present only the construction of the necessarily DFCA. 

Let Ai = {Qi, A, 0, 6i, Fi), two DFCA of the finite languages Li, ^i = max{|w| | 
w G Li}, i = 1,2. 

6.1 Intersection 

We construct the following DFA: 

A = (Qi X Q 2 X {0, . . . , 1], S, 8, (0, 0, 0), F), where 

I = min{/i,/ 2 }, 8{{s,p,q),a) = {Si{s,a),S 2 {p,a),q + 1), for s G Qi, p G Q 2 , 
q < I, and 8{{s,p, I + 1), a) = (<5i(s, a),S 2 {p, a), I -I- 1) and F = {(s,p, g) | s G 
Fi,p G F2,q < 1}. 

Theorem 4. The automaton A constructed above is a DFA for L = L{Ai) C 

L{A2). 

Proof. We have the following relations: w G Li C L 2 iff |w| < I and w G Li 
and w G L2 iS |w| < I and w G L{Ai) and w G L{A2). The rest of the proof is 
obvious. 



6.2 Union 

We construct the following DFA: 

A = (Qi X Q 2 X {0, . . . , 1], S, 5, (0, 0, 0), F), where 

I = max{Zi,Z2}, m = min{;i,/2}, 8 {{s,p,q),a) = {Si{s,a),S2{p,a),q + 1 ), 
for s G Qi, p G Q2, q < I, and 8 {{s,p, I + 1), a) = (i5i(s, a),S2{p, a), I + 1) and 
F = {(s,p, q) \ s G Fi or p G F2, q < m} U {(s,p, q) \ s G Fr and m < q < 1}, 
where r is such that Ir = 1. 

Theorem 5. The automaton A constructed above is a DFA for L = L{Ai) U 

L{A2). 

Proof. We have the following relations: w G Li U L 2 iff |w| < m and w G Li or 
w G L 2 , or m < |w| < ^ and w G Lr iS |w| < m and w G L(Ai) or w G L(A 2 ), 
or m < \w\ < I and w G L{Ar). The rest of the proof is obvious. 

6.3 Symmetric Difference 

We construct the following DFA: 

A = (Qi X Q 2 X {0, . . . , 1], S, 5, (0, 0, 0), F), where 

I = ma,x{li,l 2 }, m = min{;i,; 2 }, 8{{s,p,q),a) = {Si{s,a),S 2 {p,a),q + 1), 
for s G Qi, p G Q 2 , q < I, and 8{{s,p, I + 1), a) = (i5i(s, a),S 2 {p, a), I -I- 1) and 
F = {{s,p,q) \ s G Fi or exclusive p G F 2 , q < m} U {(s,p, g) | s G F^. and m < 
q < 1}, where r is such that = 1. 
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Theorem 6. The automaton A constructed above is a DFA for L = L{Ai)A 
L{A2). 

Proof. We have the following relations: w € L 1 AL 2 iff |w| < m and w € Li or 
exclusive w € L 2 , or m < \w\ < I and w G iS |w| < m and w G T(Ai) or 
exclusive w G ^(^ 2 ), or m < \w\ < I and w G T(Ar). The rest of the proof is 
obvious. 



6.4 Difference 

We construct the following DFA: 

A = (Qi X Q 2 X {0, . . . , 1}, S, S, (0, 0, 0), F), where 

I = max{?i, ^ 2 }, rn = min{?i, ^ 2 } and 6{{s,p, q),a) = (i5i(s, a), S 2 {p, a), q + 1), 
for s G Qi, p G Q 2 , q < I, and 6{{s,p, I + 1), a) = (i5i(s, a), S 2 {p, a), I + 1). If 
l\ < I 2 then F = {{s,p,q) \ s G Fi and p ^ F 2 , q < m} and F = {{s,p,q) | s G 
Fi and p ^ F 2 , q < m} U {{s,p,q) | s G F’l and m < q < 1}, if h > 12- 

Theorem 7. The automaton A constructed above is a DFA for L = L{Ai) — 
L{A2). 

Proof. We have the following relations: w G Li — L 2 iff |tc| < m and w G Li and 
w ^ L 2 , or m < \w\ < I and w G Li iff |w| < m and w G L{Ai) and w ^ L{A2), 
or TO < |w| < Z and w G L(Ai). The rest of the proof is obvious. 

Open Problems 1) Try to find a better algorithm for minimisation 2) or prove 
that any minimisation algorithm has complexity D(n^). 3) Find a better al- 
gorithm for determining similar states 4) in any DFCA of L. 3) Find better 
algorithms for boolean operations on DFCA. 
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Abstract. We establish a new upper bound on the number of states of 
the automaton yielded by the determinization of a Glushkov automaton. 
We show that the ZPC structure, which is an implicit construction for 
Glushkov automata, leads to an efficient implementation of the subset 
construction. 



1 Introduction 

Automata determinization may be exponential, whereas most of automata op- 
erations are polynomial. It is not possible to avoid this behaviour in the general 
case nni. Therefore it is quite natural to take great care of the implementation 
of determinization algorithm, so that this transformation penalizes as little as 
possible the performances of an automata software. 

Two different kinds of computations are carried along the determinization 
process: the computation of the transition of a state of the deterministic au- 
tomaton by a letter (which yields a subset of states of the nondeterministic 
automaton), and the set equality tests which make it possible to decide whether 
a transition generates a new state or not. 

Concerning the computation of the transitions, the choice of the data struc- 
ture used to implement the transitions of the nondeterministic automaton has a 
lot of influence on the performances. The AUTOMATE g], INR j^j and Grail jT7] 
softwares make use of a representation in which both transitions with the same 
origin and transitions with the same origin and the same letter are contiguous. 
Jonhson and Wood 0 have studied the efficiency of sorting procedures applied 
to subsets computed from such data structures. 

As for the set equality tests, the choice of the data structure for memorizing 
the subsets is obviously an important factor of the complexity. According to 
Ponty m, the number of integer comparisons involved by set equality tests 
is 0(-\/n2^"') if arrays are used and 0(n^ (log n)2") if binary search trees are 
used. Such complexities justify the work of Leslie El and Leslie, Raymond and 
Wood [T^. 

* This work is a contribution to the Automate software development project car- 
ried on by A. I. A. Working Group (Algorithmics and Implementation of Automata), 
L.I.F.A.R. Contact: {Champarnaud, Ziadi}@dir. univ-rouen.fr. 
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In this paper we consider a particular family of automata, which are com- 
puted from regular expressions following the Glushkov algorithm. Glushkov au- 
tomata have been characterized by Garon and Ziadi in terms of graphs. On 
the other hand, Ziadi et al. [El, have designed a time and space linear represen- 
tation of the Glushkov automaton of a regular expression, which is based on two 
forests of states and a set of links going from one forest to the other one. The 
ZPG structure leads to an output sensitive implementation of the conversion of 
a regular expression into an automaton uni and to an efficient algorithm for 
testing the membership of a word to a regular language Q. 

We shall show first that the number of states of the automaton resulting from 
the determinization of a Glushkov automaton can be bounded more tightly than 
in the general case. Then we shall make use of the properties of the ZPG structure 
to improve the complexity of the computation of the transitions, as well as the 
complexity of set equality tests. 

The next section recalls some useful definitions and notations, as well as 
the Glushkov algorithm and the ZPG structure design. Section 3 describes the 
subset construction. Section 4 gathers our results about Glushkov automata 
determinization . 

2 Definitions and Notations 

We shall limit ourselves to definitions involved by the description of the new 
algorithms. For further details about regular languages and finite automata, the 
references PEECS] should be consulted. 

A finite automaton is a 5-tuple Ai = {Q, A, S, I, F) where Q is a (finite) set 
of states, A is a (finite) alphabet, J C Q is the set of initial states, F C Q is 
the set of terminal states, and S is the transition relation. A deterministic finite 
automaton (DFA) has a unique initial state and arrives in a unique state (if 
any) after scanning a symbol of S. Otherwise the automaton is nondeterministic 
(NFA). The language L{M) recognized by the automaton M is the set of words 
of E* whose scanning makes AA arrive to a terminal state. 

A regular expression over an alphabet E is generated by recursively applying 
operators ‘-I-’ (union), (concatenation) and (Kleene star) to atomic expres- 
sions (every symbol of E, the empty word and the empty set). A language is 
regular if and only if it can be denoted by a regular expression. The length of 
a regular expression E, denoted \E\, is the number of operators and symbols in 
E. The alphabetic width of E, denoted ||if||, is the number of symbols in E. 

The Kleene theorem cni states that a language is regular if and only if it 
is recognized by a finite automaton. Gomputing the Glushkov automaton of a 
regular expression 0 is a constructive proof of this theorem. 

2.1 Glushkov Automaton 

Glushkov algorithm works on a linearized expression E' deduced from E by 
ranking every symbol occurrence with its position in E. For example: if if = 
a{b + a)* + a then E' = 01(62 + 03)* + 04. 
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The set of positions of E is denoted Pos{E). The application x : Pos{E) E 
maps every position to its value in E. 

Algorithm Glushkov(if) 

1. Linearize the expression E. The result is the expression E' . 

2. Compute the following sets: 

— NuUe which is {e} if e G L{E) and 0 otherwise. 

— First{E), the set of positions that match the first symbol of some 
word in L{E'). 

— Last{E), the set of positions that match the last symbol of some 
word in L{E'). 

— Follow{E,x), Vx G Pos{E): the set of positions that follow the 
position X in some word of L(E'). 

3. Compute the Clushkov automaton of E, Me = {Q, E, S, sj, E) where: 

— Q = Pos{E) U {s/} 

— Va G E,S{si,a) = {x € First{E) \ x(a:) = a} 

— Vx G Q, Va G E, 6{x,a) = {y \ y & Follow{E, x) and x{y) = 

— E = Last{E) U NuUe' {s/} 



2.2 The ZPC Structure 

The ZPC structure of a regular expression E is based on two forests deduced 
from its syntax tree T{E). These forests respectively encode the Last sets and 
the First sets associated to the subexpressions of E. The transition relation of 
the Glushkov automaton of E naturally appears as a collection of links from the 
Last forest to the First forest. 

Let us sketch the construction of the ZPC structure. 

Algorithm ZPC(if) 

1. Compute the syntax tree T{E). 

2. Compute the forests TL{E) and TF{E). 

3. Compute the set of follow links going from TL{E) to TF{E). 

4. Remove redundant follow links. 



The Lasts forest TL{E) is a copy of T{E), where a link going from a node 
labeled ‘-’to its left child is deleted if the language of its right child does not 
contain £. Thus the property: Last{{E) ■ (G)) = Last{G) U Nulla ■ Last{F) is 
satisfied. Furthermore, each node of TL{E) points to its leftmost and rightmost 
leaves, and leaves in the same Last set are linked. 

The Firsts forest TF{E) is computed in a similar way, by deleting a link going 
from a node labeled ‘-’to its right child, if the language of its left child does not 
contain £, w.r.t. the property: First{{E) ■ (G)) = First{E) U NuUe ■ First{G). 

The two forests are connected as follows. If a node of TL{E) is labeled by 
its left child is linked to the right child of the corresponding node in TF{E). If 
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a node is labeled by its child is linked to the child of the corresponding node 
in TF(E). Such links are called follow links. 

Notice that a follow link encodes the cartesian product of a Last set by a 
First set, and that the transition relation 5 is a union of such cartesian products. 
Two products are either disjoint, or included in each other. Redundant products 
are eliminated by a recursive procedure. Finally, the representation ZPC(if) is 
such that each transition is encoded in a unique follow link. 

Example 1 . Let take the expression E = a{b + a)* + a. The linearized expression 
is E' = 01(62 + as)* + 04- So we can build the ZPC-representation as shown in 

fig- III 




Fig. 1. ZPC(F;) for E' = 01(62 + 03)* + 04. 



It is convenient to process the expression E' = $(01(62 + 03)* -I- 04)#, where $ 
and ff are two distinguished positions. The position $ is associated to the initial 
state of M-e and is involved in the follow link: {$} x First{E). The position 
is reached from positions which belong to Last{E) by scanning the end of the 
input word ; it appears only in the follow link Last{E) x {#}. Notice that $ is 
involved in this link too if $ G Last{E), i.e. if L{E) contains e. 



3 The Subset Construction 

In this section we recall the subset construction. 

Let M = (Q, if, 6, /, F) be an arbitrary nondeterministic automaton. Let 
n = \Q\. The subset construction I computes a deterministic automaton 
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V = {Q' , S,S' ,{0'},F') which recognizes the same language as M. In order to 
deduce T> from Ai, we consider the application j3 which maps the states of T> 
into the subsets of states of M . 

1. Let P : Q' — s- 2'^ be a map such that: 

(a) /3(0') = I 

(b) Vg' G € S, the state = i5'(g', V) is such that: 

/?(9i) = U 
96/5(9') 

(c) /3(g') = 0 => g' = s', where s' is a unique sink state of V. 

2. q' GF' P{q') n F 7^ 0 

Notice that subset construction leads to two classical implementations: 

1. The exhaustive algorithm first computes the transitions of all the 2" — 1 
possible states and then trims the automaton. 

2. The reachable algorithm only computes the transitions of the reachable states 
of T>. The difficulty is to decide whether the state q[ = 5 '{q',l) is a new 
state. This state identification test is generally based on comparisons between 
the set P{q'i) associated with q[ and the sets P{p’) associated with states p' 
already in T>. 

4 Determinization of Glushkov Automata 

Glushkov automata have some nice properties 0 as do their ZPC-represen- 
tation We take advantage of these properties, to improve the efficiency of 
subset construction applied to Glushkov automata. 

Let us consider a regular expression E, and let ||F|| be its alphabetic width 
and Me = (Q) {0}> be its Glushkov automaton. We have \Q\ = n = 

||F|| + 1. Let T>e = (Q', If, (5', {O'}, F') be the result of the determinization of 
Me with IQ' I = n' . For each Z G If, we denote by ni the number of occurrences 
of Z in F and by Qi the set of states g such that x(g) = Z. 

As far as Glushkov automata are concerned, we can reduce the complexity 
of subset construction as we now establish. 



4.1 Bounding the Number of States in the Deterministic 
Automaton 

Proposition 1. The number n' of states of the deterministic automaton T>e 
yielded by the determinization of a Glushkov automaton satisfies 

n'<(^2"“)-|If| + l. 
aes 
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Proof. This follows from the homogeneity of a Glushkov automaton; that is, 

yq,p € Q,\/a,b € S : 6{q, a) = 6{p,b) ^ a = b. (1) 

Let us consider a subset P ^ Q {P {0}) produced by the subset construc- 
tion. There exists a subset 5 of Q and a letter a of if such that: 

P = (J ,5(s,o). 

ses 

Using m we have P'!f{q\q&Q and x(g) = a } = Qa which shows that only 
subsets of Qi^s have to be considered. 

Remark 1. This bound is generally much smaller than the general bound 2" — 1. 
The gap increases with the size of the alphabet, which is generally large in 
linguistic applications. 



Example 2. Consider the expression 

E = (oi J- (tt2 + b^)* a4.){a^ + ^e)*- 



b 





Fig. 2. a. The nondeterministic automaton M.e- b. The deterministic automaton 
T)e oi M. E ■ 



Me and T>e are given by FigureQ The number of states of T>e is at most 
127 by the classical worst case bound and is at most 19 by Proposition Q 
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Example 3. If we consider the expression 

E = {a + b)*{babab{a + b)*bab + bba{a + b)*bab){a + b)* 

given by Antimirov | 2 |, the classical bound is 8.3 million and our bound is 8.7 
thousand. 

A practical interest of our bound is to help in deciding whether an exhaustive 
implementation is practicable or not. 

4.2 Computing the Transitions of the Deterministic Automaton 

We now deal with the computation of the transitions of T>e- Let q' be a state of 
T>e and I a letter of S. Let q'l = 6'{q',l). We consider the set j{q') = \^P{q'i)- 

As a Glushkov automaton is homogeneous, we have: 

Vq,r G /3{q'), x(g) = x(r), 

for all states q' in Q' . So, for I G S, the subsets P{q'i) associated to states q[ are 
pairwise disjoint. Hence the set j^q') can be computed as a disjoint union. 

We now show that the set 'j(q') can be computed in linear time, if the ZPC 
structure is used judiciously. We have 

7 ( 9 ') = 1+) U <5(^,0= l±)W),o- (2) 

l&s sePiq') l&S 

Let us give an efficient procedure which computes Y = y(5(A, ^), where 

IGS 

X C Q. This procedure is detailed in the paper m which describes an efficient 
test for regular language membership. We briefly recall it here. It is based on 
three steps: 

Step 1: Let us compute the set A of nodes of the forest TL{E) which are at 
the same time ancestors of at least one x G X, and heads of follow links. This is 
done by inspecting the path going from any position a; of A to the root of the 
tree x belongs to. 

Step 2: Let be the set of tails of follow links associated to A. 

Step 3: Given such a set <1>, Y is the set of positions of TF{E) which are 
descendant of at least one element of Y is derived from the set of nodes of 
TE{E) which are tails of follow links, such that: 

T = y First{(p) 

Example 4- 

step 1 A = {1,3,4} 
step 2 A = |Ai, As, A 7 } 

step 3 <? = ip 7 } ^ <?' = {v5ii, v^ 6 , ~^Y = {2, 3, #} 



A — |Ai, As, A 7 } 
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Fig. 3. ZPC(F;) for E' = 01(62 -h 03 )* + 04 . 



Proposition 2 ( [14J P Let X C Q and I G S. Then, the set ^ S{X,l) 



ean be eomputed in time 0{n). 

Proposition 3. For all q' G Q' , "i{q') ean be eomputed in time 0{n). 
Proof. From Formula @ and Proposition |21 



4.3 Testing Set Equalities 

The aim now is to reduce the number of integer comparisons involved by the test 
“Is q'l a new state?” . We are going first to show that we can use sets jj(q') instead 
of sets ‘jiq'), with | 77 (g 0 | < |7(90|j ™ order to improve the time complexity 
of the set equality test. Then we shall describe a linear time algorithm which 
computes the sets 'Y'jiq'). Let us explain how these reduced sets derive from j{q') 
sets. 

Let us consider the set which is such that 7 ( 9 ') = 1+J First{(p). 

tpG'P' , 

9 

The number of First sets which occur in this disjoint union is necessarily 
at most the number of positions in ^{q'). Therefore it seems conceivable to 
characterize the state q' by the set and the state q[ by the set {q}\<p G 
and First{(f) PQi ^ 0}. This last set is smaller than /3(g(). However the set 
is not a good candidate since for two distinct states q' and q" , one can have both 
iW) = iW) and . This is shown by the following example. 

Example 5. On the figure 01 of the example 0 After determinization, we get: 

0'-{0} 7(0')= {1,4} /3(O'J = {1,4}/3(O;) = 0 

F^{1, 4} 7(1')= (2, 3, #} /3(FJ = {3} /3(i;) = {2} 

2'^{3| 7(2')= {2,3,#} 

3'^{2} 7(3')= {2,3,#} 
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We have: 7(1') = 7(2') = {2,3,#}, and = {t)6,‘Pii} 7^ ^2' = 

Let us consider the following property: 

ip e TF{E), V{p) : Va; G Pos{p),3p' G | a: G First(p'). 

By convention, V {father (r)) is false, if r is the root of a tree of TF{E). 
We now consider sets 77((?') such that: 

\/p G 77((?0) T^ip) h -^V{father{p)). 




Fig. 4 . Computation of 77(5')- 



Example 6. Let us assume that: We have: 77(9') = {po\ since 

1. Pos{pe) = First{ps) U Pirst{pu) 

2. p4 = father{po) is such that ai belongs to Pos{pa) while oi neither belongs 
to Pos{ps) nor to Pos{pu). 

Such a set jj{q') is uniquely defined, since every position in y First{p) is 

/ 

9 

represented by a unique ancestor, the closest one to the root. Hence the following 
proposition. 

Proposition 4 . If q' # q” and 7(9') # "y{q''), then 77(9') # "n{q")- 

Let us remark that since every state p' {p' # O') of Q' is generated by the 
action of a letter I on a state q' already generated, every state {p' = q[) of Q' 
can be characterized by the set 

Pf3{q'i) = {p\p& 77(9') and First{p) C Q/ # 0}. 

As 177(9;) I < 17(9;) I, we have \ j3(}{q[)\ < \(}{q[)\, which speeds up the set equality 
tests. 
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procedure create-77(^', TF{E),'y'y) 

/* <P' and TF{E) are input parameters */ 

/* 77 is an output parameter */ 

/* d is a local boolean array */ 
function dans(<p, in) 

/* ip is a, node of T{E) used as input */ 

/* in is a boolean array used as input /output */ 
begin 

if (in[(/j] = false) or (ip ^ Pos(E)) then 
switch p 
begin 

case : inM (dans(left((p), in) A dans(right((/9), in)) 

case : in[(p] <— dans(child(i/5), in) 

end 

fi 

return in[(/9] 
end 

procedure traversal((/j, d, 77) 

/* p is a node of T(E) used as input * / 

/* 77 is an output parameter */ 
begin 
if d[p\ 

then 77 <— 77 U {p} 
else li p ^ Pos(E) 

then 

begin 

traversal(left (v?) , 77) 
traversal(right (p) , 77) 

end 

fi 

end 

begin 

foreach p G TF(E) do d[p] <— false 
foreach p G do 

if father(p) = then d[p] <— true 
od 

dans(rac(r(-E)), d) 
traversal(rac(r(J 5 )), d, 77) 

end 



Fig. 5. Algorithm for computation of sets 77. 



We finally describe a linear time algorithm (see Figure ED which computes 
the sets 77(9') . 

The procedure create-jj generates the set It calls the function dans 

and the procedure traversal. The function dans constructs an array in such 
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that in[if] = true if and only if G llW)- The procedure traversal collects 
the nodes of 77(9^ through a recursive traversal of T{E). 

5 Conclusion 

The representation ZPC(if) of the Glushkov automaton of a regular expression 
E, which can be computed in linear time and space, encodes a natural partition 
of the set of transitions. Making use of this partition speeds up the computation 
of the transitions and of the set equality tests in the subset construction when 
applied to Glushkov automata. 
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Abstract. In UZI, we introduced a bit-wise representation of r-AFA, 
which greatly improved the space efficiency in representing regnlar lan- 
guages. We also described our algorithms and implementation methods 
for the union, intersection, and complementation of r-AFA. However, our 
direct algorithms for the star, concatenation, and reversal operations of r- 
AFA would cause an exponential expansion in the size of resulting r-AFA 
for even the average cases. In this paper, we will design new algorithms 
for the star, concatenation, and reversal operations of r-AFA based on 
the bit-wise representation introduced in nn. Experiments show that the 
new algorithms can significantly reduce the state size of the resulting r- 
AFA. We also show how we have improved the DFA-to-AFA transforma- 
tion algorithm which was described in PH. The average rnn time of this 
transformation using the modified algorithm has improved significantly 
(by 97 percent). 



1 Introduction 

The study of finite automata was motivated largely by the study of control 
circuits and computer hardware in the fifties and early sixties. Implementation 
of finite automata was mainly a hardware issue then. 

Since the mid and late sixties, finite automata have been widely used in 
lexical analysis, string matching, etc. They have been implemented in software 
rather than hardware. However, the sizes of the automata in those applications 
are small in general. 

Recently, finite automata and their variants have been used in many new 
software applications. Examples are statecharts in object-oriented modeling and 
design weighted automata in image compression pj, and synchro- 

nization expressions and languages in concurrent programming languages Hill- 
Many of those new applications require automata of a very large number of 
states. For example, concatenation is a required operation in most of those appli- 
cations. Consider the concatenation of two deterministic finite automata (DFA) 

* This research is supported by the Natural Sciences and Engineering Research Council 
of Canada grants OGP0041630. 
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with 10 states and 20 states, respectively. The resulting DFA may contain about 
20 million states in the worst case m 

It is clear that implementing finite automata in hardware is different from 
that in software. Hardware implementation is efficient, but suitable only for 
predefined automata. Adopting hardware implementation methods in software 
implementations is not immediate. It has to solve at least the following prob- 
lems: (1) How do we represent and store a combinational network in a program 
efficiently in both space and time? (A table would be too big.) (2) A basic access 
unit in a program is a word. The access is parallel within a word and sequential 
among words. It would not be efficient if an implementation algorithm does not 
consider the word boundary and make use of it. 

Implementing a small finite automaton is also different from implementing a 
very large finite automaton. A small finite automaton can be implemented by a 
word- based table or even a case or switch statement. However, these methods 
would not be suitable for implementing a DFA of 20 million states. 

In j1 7j . we introduced a bit-wise representation of r-AFA, which greatly im- 
proved the space efficiency in representing regular languages. We also described 
our algorithms and implementation methods for the union, intersection, and 
complementation of r-AFA. 

It has been shown that a language L is accepted by an n-state DFA if and 
only if the reversal of L, i.e., L^, is accepted by a logn-state AFA. So, the use 
of r-AFA (reversed AFA) instead of DFA guarantees a logarithmic reduction in 
the number of states. However, the boolean expressions that are associated with 
each state can be of exponential size in the number of states. In our previous 
paper ini, we introduced a bit-wise representation for r-AFA and described the 
transformations between DFA and r-AFA, and also the algorithms for the union, 
intersection, and complementation for r-AFA. The model of r-AFA is naturally 
suited for bit-wise representations. NFA and DFA could also be represented in 
certain bit-wise forms which would save space. However, their operations would 
be awkward and extremely inefficient. Our experiments have shown that the 
use r-AFA instead of DFA for implementing regular languages can significantly 
improve, on average, both the space efficiency and the time efficiency for the 
union, intersection, and complementation operations. 

We know that the resulting DFAs of the reversal and star operations of an 
n- state DFA have 2" and 2"“^ -|- 2"“^ states, respectively, in the worst case, 
and the result of a concatenation of an m-state DFA and an n-state DFA is 
an m2" — 2"“^-state DFA in the worst case m Therefore, in the worst case, 
the resulting r-AFA of the corresponding operations of r-AFA have basically 
the same state complexities, respectively. Direct constructions for the reversal, 
star, and concatenation of r-AFA, as described for AFA in IHCl, would have an 
exponential expansion in the number of states for each of the above mentioned 
operations. 

In this paper, we present our new algorithms for the reversal, star, and con- 
catenation operations of r-AFA. These algorithms simplify the r-AFA during the 



Implementing Reversed Alternating Finite Automaton (r-AFA) Operations 



71 



operations. They do not necessarily produce a minimum r-AFA, but they reduce 
the number of states tremendously in the average case. 

Our experiments show that the algorithms reduce not only the size of the 
state set but also the total size of a resulting r-AFA in the average case. 

At the end, we also show how we have improved the DFA-to-AFA transfor- 
mation algorithm described in \n\ The average run time of this transformation 
using the improved algorithm has been reduced significantly (by 97 percent), 
and for the same input DFA, the resulting AFA of the original and the improved 
algorithms are the same up to a permutation of states. 

2 Preliminaries 

The concept of alternating finite automata (AFA) was introduced in |5| and 0 at 
the same period of time under different names. A more detailed treatment of AFA 
operations can be found in |^. In the paper ini, we modified the definition of 
AFA and introduced h- AFA and r-AFA. The notion of “reversed” AFA, r-AFA, 
is considered also in |2H but the definition there differs in a minor technical detail 
from the definition used here and in H2|. Below we briefly recall the definition 
of AFA. For a more detailed exposition and examples the reader is referred to 
m- Background on finite automata in general can be found in PI- 

We denote by B the two-element Boolean algebra and stands for the set 
of all functions from the set Q to B. 

An h-AFA A is a quintuple (Q, E, g, h, F), where Q is the finite set of states, 
E is the input alphabet, 

g -.Qx Ex B^ ^ B 

is the transition function, 

h: bQ ^ B 

is the accepting Boolean function, and F C Q is the set of final states. 

We use gq : E X B^ ^ B to denote that g is restricted to state q, i.e., 
gq{a, u) = g{q, a, u), for a G E, u £ B^ , q G Q. 

We also use gq to denote the function from E x B^ to B^ that is obtained 
by combining the functions gq, q G Q, i.e., 

9Q = {9q)qeQ- 

We will write g instead of gq whenever there is no confusion. 

Let u G B^. We use u{q) or Uq, q G Q, to denote the component of the 
vector u indexed by q. By 0 we denote the constant zero-vector in B^^ (when Q 
is understood from the context). 

We extend the definition of g of an /i-AFA to a function: Q x E* x B^ — > B 
as follows: 

g{q,\,u) = Uq 
g{q,auj,u) = g (q,a,g{uj,u)) 
for all q G Q,a G E,oj G E*,u G B^ . 
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Similarly, the function gq, or simply g, is extended to a function S* x > 

bQ. 

Given an h-AFA A = {Q, B, g, h, F), for w G S* , w is accepted by A if and 
only if h{g{w,f)) = 1, where / is the characteristic vector of F, i.e., /^ = 1 iff 
g G F . 

An r-AFA A is an /i-AFA such that for each w G S* , w is accepted by A if 
and only if h{g{w^ , /)) = 1, where / is the characteristic vector of F. 

The transition functions of /i-AFA and r-AFA are denoted by Boolean func- 
tions. Every Boolean function can be written as a disjunction of conjunctions 
of Boolean variables. We call a conjunction of Boolean variables or a constant 
Boolean function a term. A term x\ A . . . AXk is denoted simply as Xi ■ ■ ■ Xk- The 
negation of a variable x is denoted as x. 

The following result, that was proved in nq states that any Boolean term 
can be represented by two bit-wise vectors. 

Theorem 1 For any Boolean function f of n variables that can be expressed as 
a single term, there exist two n-bit vectors a and (3 such that 

f{u) = 1 <;=^ {aSzu) t /3 = 0, for all u G B^, 

where & is the bit-wise AND operator, | the bit-wise EXCLUSIVE-OR operator, 
and 0 is the zero vector (0, . . . ,0) in B^ . 

Each n-bit vector v = (ui, . . . ,r„), n < 32 (32-bit is the normal size of a 
word), can be represented as an integer 

n 

We can also transform an integer ly back to a 32-bit vector v in the following 
way: 

Ui = (/„&2*-i)/2*-\ l<i<n. 

So, a Boolean function, which is in disjunctive normal form, can be repre- 
sented as a list of terms, while each term is represented by two integers. 

For an r-AFA A — (Q, E, g, h, F), where Q = {gi, . . . , g„} and E = {ai,. . ., 
im}, we can represent g as a table of functions of size nxm with (i,j) entry cor- 
responding to the function gq,{aj) : B^ — *■ B defined by [gqi{aj)]{u) = gq.{aj,u), 
for Pi G Q, Uj G E, and u G B^ . The accepting function h can be represented as 
a list of integer pairs. Finally, F can be represented as a bit- vector (an integer), 
i.e., as its characteristic vector. 

3 Algorithms for Operations of r-AFA 

In this section, we present the algorithms for constructing the star and reversal 
of a given r-AFA and the concatenation of two given r-AFA. Note that after each 
construction, a simplification algorithm is applied to each Boolean function in 
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order to reduce the size of the Boolean expression. An algorithm which we have 
implemented can be found in m- We omit the proofs of the correctness of the 
algorithms due to the limit on the size of the paper. 



3.1 The Star Operation 

First, we consider the star operation. The algorithm eliminates all the useless 
states during the construction. Our experiments suggest that the r-AFA resulting 
from this method are significantly smaller than the naive algorithm described in 

0 

Let A = {Q,S,g,h,F) be a given r-AFA with Q = {gi,... ,qn}- We con- 
struct a r-AFA A' = (Q', S,g' , h' , F') such that L(A') = L(A)* in the following. 
Let / be the characteristic vector of F. 

Now we describe the algorithm, which deletes all the useless states during 
the construction. 

If n = 0 and h = 1, then the r-AFA A accepts S*. So, L{A') = (L{A))* is 
the same as L{A) and we can just let A' = A. If n = 0 and h — 0, then A accepts 
the empty language and A! accepts the language that contains only the empty 
word A. Then A! is the one-state r-AFA where Q' = {0}, F' = 0, h{xo) = xq, 
and g(a, xq) = 1 for all a G S. 

Now we assume that n is a positive integer. 

F' = 0. 

Let I be an array of integers of size 2". 

Initialize I[k] = 0, for k = 0, . . . , 2" — 1. 

Use the procedure Markarray(J, A) described at the end of this subsection 
to mark /, so that 

I[k] = 1 3 a; £ A7*, s.t. g(x, f) = k , for fc /; 

/[/] = 1 ^ 3x e U* s.t. h{g{xj)) = 1. 

If /[/] = 0, that means that A accepts nothing. We can construct A! the 
same way as in the case n = 0. Otherwise go to the next steps. 

Let I[ko], . . . , I[kp] be all the entries of I that have a value 1. Let P be an 
array of integers of size p, such that P[i] = ki. 

Set Q' = {0, . . . ,p}. 

Similarly as above we assume that the set {0,1,... , 2" — 1} is identified with 
This means that, for all i G Q' , the entry P[i] will represent an element 
of PQ. 

Define the head function h' as follows: 

h'{u) = 1 u = 0 or 3t G Q' such that ut = 1, h{P[t]) = 1. 
Thus, 

h'{{xo, . . . ,Xp)) = Xo ■ . .XpV \J Xi. 

h(p[i\)=i 
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For any a G S and k such that P[k] ^ f, define as: 

g'{k, a,u) = 1 4=^ 3 s € Q', such that Ug = 1, g{a, -P[s]) = P[k] 
if u 7^ 0 ; and 

g'{k,a,0) = 1 4=^ g{a,f) = P[k], 

That is, 



g'{k,a,{xo,... ,Xp)) 



Vg(a.p[i])=P[fe] Xi\/xo...Xpif g{a, /) = P[k] 
Vg(a.p[i])=P[fe] otherwise. 



Assume t is an index such that P[t] = f. Define as: 



g'(t, a,u) = 1 4=^ 3 s G Q' such that Ug = 1, g(a, P[s]) = / 

or 3rGQ', such that = 1, h{g{a, P[r])) = 1 

if u yf 0 ; and 

g'(t,a,0) = 1 4=^ g(a,/) = / or ft.(g(a, /)) = 1. 

Thus, g'{t, a, (xi , . . . ,Xp)) = 

f Vg(a,p[i])=/ or M3(a.PW)H x^\/xo.. -Xp if /i(g(o, /)) = 1 Or g(o, /) = / 
I Vg(a,p[i])=/ or ft(g(a.p[i]))=i Otherwise. 



The following procedure can be used to mark the array used in the above 
algorithm. 

Procedure Markarray(J, A) 



Qu is a queue of integers. Initially, Qu has only one entry /. 
if (Hf) == 1) 

i[f] = 1; 

while (!Empty(Qu) 

{ 

int imp = pop (Qu); 
int vector; 
for (a G S) 

{ 

vector = g{a, trap); 
if(/[uector] == 0) 

{ 

I[vector] = 1; 
push(uector, Qu); 

} 

if (/[/] == 0 && h{vector) == 1) 

m = 1 ; 

} 

} 
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3.2 Concatenation of r-AFA 

Let where i = 1,2, be two r-AFAs. We con- 

struct a r-AFA A = {Q, S, g,h, F) which accepts the concatenation of L(AA)) 
and 

Let the numbers of states of and are n and m, and fi be the 
characteristic vectors of for z = 1, 2 respectively. We construct A in three 
cases as follows: 

First, let us assume mn yf 0 and 

Q = {go,-- - ,g„-i,g„,--- ,g„+ 2 ™-i}, where = (go,-- - ,gn-i} and qu 
are new states for k > n. 

if/r(D(/,)=o 

{ U {gn+z^} otherwise. 

We identify the numbers n, . . . , n + 2"^ — 1 with elements of * , thus in 
the above definition of F the notation n +/2 stands for the number belonging 
to |n, ... , n + 2™ — 1} that denotes / 2 . 

Define h{u) = 1 <1=^ 3 a;, such that 0 < a; < 2™ — 1, Ux+n = 1, and 
h^‘^\x) = 1, for u G F'3; that is, 

^((^0, - - - , ^n-t-2^ — l)) — ^n+i- 

We define g{a,u)\Q(i) = g^^'>{a,u\Q(i)), for u G and a G S. That is, 
g{qi,a, (a^o, - - - ,a;„+ 2 ™_i)) = g^^^(g*,a, (a;o, - - - ,Xn-i)) (Vz < n). 

Also define g{a, u)q^ = 1 <1=^ 3 y, such that 0 < z/ < 2™ — 1 and Uy+n = 1, 
= X — n, for X > n, X ^ n + f2, u G and a G E; that is, 

g{q^,a,{xo,■.■ ,x„+ 2 ^-i) = \J Xk- 

{a,k—n)—i—n 

Defineg(a, zz)q^^j.^ = 1 <1=^ /z^^^((g(a, zz)lg(i))) = 1 or 3 y, such that y > n, 
Uy = 1, and g^^^a, y — n) = f 2, for all u G B^ and a G E. Thus, 

g(g„+/ 2 ,a, (a;o,... ,a;„+ 2 ™-i)= \J Xk\J T{a,xo, . . . ,Xn-i)- 

gG) (a,k-n)=f2-n. 

where the Boolean function T is the resulting function obtained by substi- 
tuting g^^^(gi, a, (xg, ■ ■ ■ , x„-i)) for Xi for all i < n. 



Secondly, if we can construct two r-AFAs Ai = (Qi, E, gi, hi,Fi) 

and A2 = {Q2, E, g2,h2, F2), where E = U E^‘^\ such that L{Ai) = F(A(*^) 
for z = 1 , 2 . For example, let assume that E yf F^). The construction of Ai is 
as follows: 
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Construct Qi = U {newq}, where newq is a new state. 

Set J^i 

Define hi as: hi{u) = / i ^^^( u | q ( i )) A Unewq for all u £ . 

For any q £ a £ S and u £ define: 



gi{q,a,u) 



g^'^\q,a,u\Q{i)) if a G S^^'> 
0 otherwise. 



For any a £ B and u £ B^^ , we define: 



gi{newq, a, u) 



Unewq if O G 

1 otherwise. 



The last case is mn = 0. We can assume in this case (otherwise 

we can use the above construction to convert to this case). 

Assume that n = 0 and m yf 0. We omit the other cases here because they 
use a similar construction. 



If = 0, then L{A^^'>) = 0. So, = 0. Therefore, we can let 

A = 

If = 1, construct Ai = (Qi, B, gi, hi, Fi), such that |Qi| = I and 
L(Ai) = L(A^^^) as follows: 

Qi = {g}, hi = 1 and F’l = 0, where g is a new state; 
gi (q,a,u) = 0 for all a G A, u G B^^ . 

Use the algorithm for the case mn yf 0 and B^^~^ = to construct the 
r-AFA accepting the concatenation of L{Ai) and 



Note that in the construction for the first case, we don’t really need all the 
2™ states that resulted from A^'^\ In fact, most of the states are of no use in 
general. We can use a similar method as was used in the construction for the 
star operation to reduce the state complexity. Briefly stated, we mark all the qi 
for i > n such that i — n can be reached from /2 under the transitions. Then 
we use the marked states to do the construction. We omit the construction here 
because the reader can easily fill in the details. 



3.3 Reversal of r-AFA 

The “reversal” of a r-AFA A = {Q, B, g, h, F) is a r-AFA A! = (Q', B, g' , h! , F') 
which accepts the reversal of the language of A, that is, L{A') = [L(A)]^. Next 
we give the construction for the reversal operation, which simplifies the rever- 
sal r-AFA during the construction by removing all the useless and unreachable 
states. We assume n > 0. 

Let I be an array of integers of size 2”, Qu be a queue of integers, T[2”] be 
an array of integer pointers, 3=1111 and / be the characteristic vector of F. 
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Initially, let 
Qu be empty; 

I[i] = 0, for all 0 < t < 2"; 

T\i] = 0 for all 0 < t < 2”; 
int index = 0; 
int temp, vector; 

Qu.push(/); 

Find all vectors v G that can be reached by /, i.e., 3x & S* such that 
V = g{x, /). The details are as follows: 
while (IQw.emptyO) 

{ 

temp = Qu.popQ; 

T [index] =new int[s + 1]; 

T[mfiea;][0] = temp; 

I [temp] = index; 
for (a G S) 

{ 

vector = g{a,temp); 

T[index][a] = vector; 

if {{vector\ = /) && {I[vector] == 0)) 

QM.push(r;ector); 

} 

index++; 

} 

for (int i = Q; i < index; t + +) 
for (a G S) 

ma] = I[ma]]; 

Construct an inverse table for T. 

Let R[index][s] be a two-dimensional array of sets of integers. The function 
Add(i?[t][j], A:) add the integer k into the set Initially, = 0 for 

all 0 < z < index, 0 < j < s. Qu is also initialized to be empty, 
for (int z = 0; z < index; z -I- -f) 

{ 

for (a G B) 

Add(i?[T[z][a]][a], z); 
if (/z(T[z][0])== 1) 

{ 

Qzz.push(z); 

Add(F", z); F” is a set of integers 

} 

} 

Mark the new useful states. 

Initialize I[k] to 0, for fc = 0, . . . , index — 1 
if(/z(/)==l) /[0] = 1; 
while (IQzz.emptyO) 
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{ 

temp = Qw.popO; 
if (!empty(i?, temp)) 

{ 

I [temp] = 1; 
for (o G S) 

for (q G R[temp][a]) 

if m == 0) 

QM.push(g); 

} 

} 

Rename the states. 

if m == 0) 

A' accepts the empty language; 

else 

{ 

int k = 0] 

for (int i = 0; i < index; i + +) 
if (/[z]! = 0) 

{ 

I[i] = k; 
k H — \~] 

} 

} 

Construct A! . 

Construct F' . 

Let F' = 0; 
for (i in F") 

if(/[i]! = 0) Add{F',I\i]) 

Construct the transition function g' . 
for (int i = 0; i < index; i + +) 
if (/[*]! = 0) 
for (a G S) 

g'{I[i],a) = xpq}; 

Construct the head function h' . 
h'{u) = uo; 

4 Improved Implementation of the DFA to r-AFA 
Transformation 

In [Ej it is shown that the bitwise representation of r-AFA significantly reduces 
the space needed to implement regular languages. However, manipulating bit- 
vectors is time-consuming if they are handled as arrays of I’s and O’s rather 
than integers. This is exemplified by the improvements made recently to the 
implementation of the DFA to r-AFA transformation of m- The majority of 
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these improvements involve increasing the efficiency of bit- vector handling, and 
are described below. Overall the run-time of this transformation has been de- 
creased by 97 percent. Thus, the computation of r-AFA for large input DFA is 
now entirely feasible with respect to time. 

4.1 Handling Bit- Vectors: Weight 

The computation of an r-AFA for a given 2"-state DFA involves the set of bit- 
vectors i?" = {0, 1, 2, .., (2") — 1}. Several parts of the DFA to r-AFA trans- 
formation involve information related to the number of I’s contained in these 
bit-vectors (referred to as their “weight” in my- 

Step 2a): given an interval of integers, find the integer of lowest weight; 

Step 3b): sort arrays Pf and of bit- vectors ascending on weight; 

Step 7 : simplifying Boolean functions; two terms tl and t2 differ only in the 
negation of of one variable iff tl "f f2 ("f: Exclusive OR) has weight one. 

These three parts of the transformation were in fact the most time-consuming, 
because computing the weight of elements of R" was done directly: 

for(i=0; i<32; i++) 

if ( bit_vector & pow(2,i) == pow(2,i) ) 
weight++ ; 

Since the transformation requires such weight-related information, it has proven 
extremely useful to build an array W which contains the elements of R" in order 
of increasing weight. The procedure Build_Weight_Array(int n) described below 
accomplishes this (in time 0(n)). Traversing W is then the same as traversing 
R" in order of increasing weight. Thus, steps 3a) and 3b) of the transformation 
can be implemented as one loop: 

for each bit-vector u in W 

if( h(u)==l ) 

add u to array Pf ; 

else 

add u to array Pn; 

After filling Pf and this way, they are already sorted ascending on weight 
(making implementation of step 3b) unnecessary). This improvement alone re- 
duced the runtime of the filter by 70 percent. 

To compare the weights of some subset S of R", an array W’, the inverse of 
W, is useful, where W’[W[i]] = i for all i in Then for any k in W’[k] is 
the array index of k in W. So if k is the least element in set W’[S], then W[k] is 
the element of S with the smallest weight. Using this method to compute step 
2a) also greatly reduced the overall runtime of the transformation. 

Finally, one of the tests most often applied in the Boolean function simplifi- 
cation procedure being used is whether two terms differ only in the negation of 
one variable. And since we are representing terms using bit-vectors, two terms 
tl and t2 differ only in the negation of one variable iff tl t t2 has weight one. 
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Previously, all 32 positions of the bit-vector tl "f t2 were scanned for I’s. But if 
array W’ is available, then 

(the weight of bit-vector v is 1) iff (1 <= W’ [v] <= n) 

making only two integer comparisons necessary for the test. This improvement 
decreased the runtime of the simplification routine by 95 percent. 

What follows is the pseudocode describing how array W is constructed. To 
simplify the description, W is described in terms of blocks, where block i contains 
the integers of i?" with weight i. Numbers in block i + 1 are generated by 
either incrementing (giving an “incremented Jnt” ) certain numbers in block i, or 
shifting (giving a “shiftedJnt” ) certain numbers in block i -|- 1 as follows: 

Build_Weight_Array (n) 

{ 

W[0] =0; //O is considered a shifted_int 

for(i = 1 . . n) { 

for each shifted_int s in block i-1, 

(next incremented_int of block i of W) := s + 1; 

for each incremented_int j in block i 
while( k=lef tshif t ( j ) <= (2"n)-l ) 

(next shifted_int of block i of W) := k; 

} 

} 

For example, let n=4. Then, 



block 0 : W[0] = ~ = 0 = 0000“/. 
block 1 : W[l] = 0+1 = 1 = 0001“/. 

W[2] = «1 = 2 = 0010“/. 

W[3] = «(«1) = 4 = 0100“/. 

W[4] = «(«(«!)) = 8 = 1000“/. 
block 2 : W[5] = 2+1 = 3 = 0011“/. 

W[6] =4+1 = 5 = 0101“/. 

W[7] =8+1 = 9 = 1001“/. 

W[8] = «3 = 6 = 0110“/. 

W[9] = «(«3) = 12 = 1100“/. 

W[10]= «5 = 10 = 1010“/. 

block 3 : W[ll]= 6+1 = 7 = 0111“/. 

W[12]= 12+1 = 13 = 1101“/. 

W[13]= 10+1 = 11 = 1011“/. 

W[14]= «7 = 14 = 1110“/. 

block 4 : W[15]= 14+1 = 15 = 1111“/. 
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Abstract. Directed Acyclic Subsequence Graph (DASG) is an automa- 
ton that accepts all subsequences of a given string T. DASG allows us 
to decide whether a string S is a subsequence of T in C1(|S'|) time where 
I S'! is the length of S. We show that if we slightly modify the string T, 
it is possible to get the DASG for the modified string from the original 
DASG. For this purpose we define these operations on DASG: adding 
a state on the left, deleting a state on left, adding a state on the right, 
deleting a state on the right, adding an inner state, deleting an inner 
state and replacing a transition label. For each of these operations we 
describe the modification of DASG and the proof of correctness. 



1 Introduction 

A subsequence of a string is any string obtained by deleting zero or more symbols 
from the given string. Subsequences play an important role in data processing 
and genetic applications (jS|)- For example, the longest common subsequence 
(PI) and sequence alignment (0) are the best known problems. An important 
question to answer is the membership problem. That is, we are to determine 
whether a given string S' is a subsequence of another string T. If we allow the 
string T to be preprocessed, we can answer the question in optimal time (that is, 
in time 0(|S|)). During the preprocessing of the string T we build an automaton 
that accepts all subsequences of this string. Such an automaton is called Directed 
Acyclic Subsequence Graph (DASG) and was introduced in p. A left-to- right 
algorithm for building DASG is described in p) . DASG is analogous to Directed 
Acyclic Word Graph (DAWG)^] using subsequences instead of substrings. 

Let S be an alphabet and T = tit 2 ■ ■ - tn a string over this alphabet. The 
DASG for the string T is a finite automaton A = {Q, E,S,qo, F), where Q = 
{go,9ii ■ ■ • ,9n} is a set of states, E is the input alphabet, S : Q x E ^ Q is a, 
transition function, qo is the initial state, and F = Q is a set of final states (all 
the states are final). Glearly, the automaton has minimal number of states as 
it must accept the string T. The automaton can be partial, that is, each state 
need not have a transition defined for each symbol. The transition function S is 
defined as follows (for a G E and 0 < i <n): 

* This research has been supported by GACR grant No. 201/98/1155 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 82-|^3 1999- 
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6{qi, a) = Qj, if there exists k > i such that a = tk {j is minimal such fc), 

6{qi,a) = 0, otherwise. 

For convenience, we define S* : Q x E* — > Q recursively by: 

S*{q,a) = S{q,a) 

S*{q,xa) = 6(6*{q,x),a) 
for q G Q,a G E and x G E*. 

Let us suppose that we have built the DASG for the string T = t\t 2 ■ ■ - tn and 
that the string T‘ can be obtained ^from the string T using one of the following 
operations: adding a symbol, deleting a symbol, and replacing a symbol. We 
show how to modify the DASG for the string T to obtain the DASG for the 
string T‘. For this purpose we describe these operations on the DASG: 

— adding a state on the left, 

~ deleting a state on the left, 

— adding a state on the right, 

~ deleting a state on the right, 

~ adding an inner state, 

— deleting an inner state, 

— replacing a transition label. 

Obviously, adding a state on the left and on the right are special cases of adding 
an inner state. Similarly, deleting a state on the left and on the right are special 
cases of deleting an inner state. As the algorithms for these special cases are 
much simpler than for general operations, we present them separately. Each 
operation preserves the minimality of the number of states which results from 
the fact that the modified automaton have to accept the modified string T‘. 

For the time complexity, we consider representation of the DASG with ex- 
plicit expression of transitions. That is, if we delete a state of the DASG, we 
have to delete all its transitions and it takes at least time linear in the number 
of transitions of this state. 

2 Adding a State on the Left 

The operation is required if we add a symbol x before the first symbol of the 
string T. Then, T‘ = xT = xt\t 2 ■ ■ - tn- The DASG for the string T‘ is = 
(Q‘, E, (5‘, q, Q‘), where Q‘ = Q U {g} and <5‘ is defined as follows: 
i5‘(p, a) = 5{p, a) for all p G Q and all a G E, 

S‘{q,x) = qo, 

S^{q, a) = S{qo, a) for all a G E, a x. 

Lemma 1 accepts a string S = siS 2 ■ ■ ■ s„i if and only if S is a subsequence 
of the string T‘ . 

Proof. We prove the Lemma in two steps. 

1 . accepts all subsequences of T‘ : For any subsequence S of T‘ the following 
holds: S = R, or S = xR, where R denotes a subsequence of T. Let us 
consider both cases: 
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(a) S = R: If Si ^ x, then <5‘(g, si) = 6{qo, si) and because S is accepted by 
A, it is accepted by as well. 

If Si = X, then S^{q, si) = qg. If S 1 S 2 ... Sm is a subsequence of T, then 
S 2 ■ ■ ■ Sm is also a subsequence of T. Therefore, S 2 ■ ■ ■ Sm is accepted by 
A and S is accepted by A^. 

(b) S = xR: 6^{q,x) = qo. As S is accepted by A, xS is accepted by A‘. 

2. If S' is accepted by A‘, then S is a subsequence of T‘: Again, there are two 
possibilities: 

(a) Si = x: S‘(q, si) = qo and S 2 ■ ■ ■ Sm is accepted by A, that is, S 2 ■ ■ - Sm is 
a subsequence of T. Consequently, S is a subsequence of T‘. 

(b) Si yf x: 6\q,si) = S{qo,si) and S is accepted by A, that is, S is a 
subsequence of T. Hence, S is a subsequence of T‘ as well. 

□ 



Algorithm: 

Input: DASG for the string T = t\t 2 ... in 
Output: DASG for the string T‘ = xtit 2 ■ ■ - tn 

1: S(q,x) ^ qo 

2: for each a€lf,a^xdo 

3: S(q,a) ^ S(qo,a) 

4: end for 

5: Q^QU{q} 

Time complexity: the algorithm requires 0(|i7|) time. 



3 Deleting a State on the Left 

The operation is required if we delete the first symbol of the string T. Then, 
'T‘ = t 2 t 3 -..tn- The DASG for the string T‘ is A‘ = S,S\qi,Q^), where 

Q‘ = Q \ { 90 } and i5‘ is defined as follows: 
i5‘(p, a) = S{p, a) for all p G and all a G S. 

Lemma 2 A‘ accepts a string S = S1S2 ■ ■ ■ Sm if and only if S is a subsequence 
of the string T‘ . 

Proof. We prove the Lemma in two steps. 

1. A‘ accepts all subsequences of T‘: If S' is a subsequence of T‘, then tiS = 
tiSiS 2 ■ ■ ■ Sm is a subsequence of T. Consequently, tiS is accepted by A. As 
S{qo,ti) = 9 i, S is accepted by A‘. 

2. If S is accepted by A‘, then S is a subsequence of T‘: If A‘ accepts S, then 
A accepts tiS = tiSiS 2 ■ ■ ■ Sm- Hence, tiS is a subsequence of T and this 
implies that S is a subsequence of T‘. 

□ 



Algorithm: 

Input: DASG for the string T = tit 2 ■ ■ ■ t„ 
Output: DASG for the string T‘ = ^2 ■ • ■ tn 
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1: for each a G S such that 6{qo, a) ^ 0 do 

2 : delete the transition i 5((70) a) 

3: end for 

4: Q ^Q\ {go} 

Time complexity: the algorithm requires 0(|i7|) time. 



4 Adding a State on the Right 

The operation is required if we lengthen string T by a symbol x. Then, T‘ = 
Tx = tit 2 ■ ■ -tnX. The DASG for the string T‘ is = (Q‘, i7, <5‘, go> Q‘)j where 
Q‘ = Q U (gj and (5‘ is defined as follows: 

(5‘(p, a) = 5{p, a) for all p € Q and all a G U, 

5%p, x) = q for all p G Q such that 5{p,x) = 0. 

Lemma 3 accepts a string S = siS 2 ■ ■ ■ s„i if and only if S is a subsequence 
of the string T‘ . 

Proof. We prove the Lemma in two steps. 

1. accepts all subsequences of T‘: For any subsequence S of T‘ the following 
holds: S = R, or S = Rx, where i? is a subsequence of T. Let us consider 
both cases: 

(a) S = R: A accepts S and therefore accepts S as well. 

(b) S = Rx: A accepts R, that is, there exists a state p G Q, such that 
S*{qo, R) = p. According to definition the same transitions also exist in 
A‘ and p has a transition for the symbol x In A1 Hence, accepts S. 

2. If S' is accepted by A‘, then S is a subsequence of T‘: There are two possi- 
bilities: 

(a) Sm = x: siS 2 ■ ■ ■ Sm-i is accepted by A, that is, siS 2 . . . Sm-i is a subse- 
quence of T. Consequently, S is a subsequence of T‘. 

(b) Sm ^ x: S is accepted by A, that is, S is a subsequence of T and hence 
S is a subsequence of T‘ as well. 

□ 



Algorithm: 

Input: DASG for the string T = t\t 2 ... in 
Output: DASG for the string T‘ = t\t 2 ■ ■ ■ tnX 
1: i ^ n 

2: while i > 0 and S{qi,x) = 0 do 
3: S{qi,x)^q 

4: i ^ i — 1 

5: end while 

6: g^QUjg} 

Time complexity: the algorithm requires 0{n) time in the worst case. 
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5 Deleting a State on the Right 

The operation is required if we delete the last symbol of the string T. Then, 
T‘ = t\t 2 ■ ■ ■ tn-i- The DASG for the string T‘ is = (Q‘, S, <5‘, qg, Q‘), where 
Q‘ = Q \ {?«} and i5‘ is defined as follows: 
i5‘(p, o) = <5(p, o) for all p G Q and all a G A, a yf x, 

5%p, x) = 0 for all p G Q such that 5{p, x) = qn, 
i5‘(p, a;) = 5{p, x) for all p G Q such that 5{p, x) qn- 

Lemma 4 accepts a string S = siS 2 ■ ■ ■ Sm if and only if S is a subsequence 
of the string T‘ . 

Proof. We prove the Lemma in two steps. 

1. accepts all subsequences of T‘: If S' is a subsequence of T‘, then St„ = 
S 1 S 2 . . . Smtn is a subsequence T and St„ is accepted by A. Hence, S*{qo, S) = 
qk,0 k < n. According to definition the same transitions exist in A‘ and 
the following holds: S‘*(qo, S) = qk- Hence, S is accepted by A‘. 

2. If S is accepted by A‘, then S is a subsequence of T‘: If S is accepted by A‘, 
then Stn = S 1 S 2 ■ ■ ■ Smtn is accepted by A, that is, St„ is a subsequence of 
T. Consequently, S is a subsequence of T‘. 

□ 



Algorithm: 

Input: DASG for the string T = t\t 2 . . . tn 
Output: DASG for the string T‘ = tit 2 ■ ■ ■ tn-i 
1: i ^ n — 1 

2: while i > 0 and 6{qi,ti) = <Zn do 
3: 6{qi,U)^^ 

4: t <— i — 1 

5: end while 

6: Q ^ Q \ {qn} 

Time complexity: the algorithm requires 0{n) time in the worst case. 



6 Adding an Inner State 

The operation is required if we insert a symbol x into the string T. Then, 
the modified string T‘ = G . . . The DASG for the string T‘ is 

= {Qf qo, <5‘)) where Q‘ = QU {g} and <5‘ is defined as follows: 
i5‘(p, a) = 5{p, a) for all p G Q and all a G A, a yf x, 

S^{q, a) = S{qi, a) for all a G A, 

S‘(p, x) = S(p, x) for all p G {g*+i, . . . , qn}, 

for all p G {qo , . . . , qi}: S‘{p, x) = q if 6{p, x) = 6{qi, x), 6‘{p, x) = 6{p, x) other- 



wise. 
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Lemma 5 A‘ accepts a string S = S 1 S 2 ■ ■ ■ Sm if and only if S is a subsequence 

of the string T‘ . 

Proof. Let T\ = t\ . . .ti and T 2 = ti+\ . . - tn- We prove the Lemma in two steps. 

1. A‘ accepts all subsequences of T‘: Let us consider maximal j such that 

S 1 S 2 ... Sj is a subsequence of T\ and denote S\ = S\S 2 . . . sj. Then, s^+i ^ x, 

or Sj+i = X. 

(a) Sj+i ^ x: As S'iT 2 is a subsequence of T, it is ^accepted by A. Therefore 

the following holds: S*{qo, Si) = 0 < A: < L As s^+i yf x, S{qk, Sj+i) = 

qi,{i + 1) < I < n and 5*{qi, Sj +2 ■ ■ ■ Sm) = qp, I < P Si n. According to 
definition the same transitions exist in A‘. Hence, A‘ accepts S. 

(b) Sj+i = x: 6*{qo,Si) = qk,Q Si k < i and according to definition the 
same transitions exist in A‘: 6^*{qo,Si) = qu- Then, Sfqk,x) = q and 
according to definition q in has the same transitions as qt in A: 
S{qi,Sj+i) = Sj+i). As Sj+i . . . Sm is a subsequence of T 2 , A ac- 
cepts ti . . . tiSj+i . . . Sm- Hence the following holds: S*{qi, Sj+i . . . Sm) = 
qiA Si I Si n and according to definition also 5‘*(g, s^+i . . . Sm) = qi- 
Consequently, accepts S. 

2. If S' is accepted by H‘, then S is a subsequence of T‘: Let us consider maxi- 
mum j such that i5‘*((7o, si . . . Sj) = qk,0 Si k < i and denote Si = S 1 S 2 . . . sj. 

Then, s^+i yf x, or Sj+i = x. 

(a) Sj+i yf x: 6^{qk, Sj+i) = S{qk,Sj+i) = qi,where{i -f 1) < ? < n, and also 
S^*{qi, Sj +2 ■ ■ ■ Sm) = d*{qi, Sj +2 ■ ■ ■ Sm) = qp,l Si P Si n. Hence, A accepts 
S and S is a subsequence of T. Any subsequence of T is a subsequence 
of T‘ as well. 

(b) Sj+i = x: 6^{qk,x) = q and according to definition q in has the 
same transitions as g, in A. Consequently, the following also holds: 
i5‘*(g, sy +2 . . . Sm) = 6* {qi, Sj +2 ■ ■ ■ Sm) = qp,wherei < p < n. Hence, 
S\ti+i . . .tn and ti . . .tiSj +2 ■ ■ ■ Sm are accepted by A. Then, Si is a 
subsequence of T\ and Sj +2 ... Sm is a subsequence of T 2 . Consequently 
S = S\xsj +2 ... Sm is a subsequence of T‘ = T 1 XT 2 . 

□ 

Algorithm: 

Input: DASG for the string T = tit 2 ... t„ 

Output: DASG for the string = t\ . . . Uxti+i . . .tn 
1: for each a G A do 
2: 6{q,a) ^ S{q^,a) 

3: end for 
4: old ^ 5{qi, x) 

5: j ^i 

6: while J > 0 and 6{qj,x) = old do 
7: S{qj,x)^q 

8: j ^ j 

9: end while 

10: Q^Q\J{q) 

Time complexity: the algorithm requires 0{n) time in the worst case. 
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7 Deleting an Inner State 

The operation is required if we delete an inner symbol ti from the string T. Then, 
=ti . . . ti-iU+i . . An- The DASG for the string T‘ is A‘ = (Q‘, S, S\qo, Q‘), 
where Q‘ = Q \ {qi} and i5‘ is defined as follows: 
i5‘(p, a) = 5{p, a) for all p & and all a G S,a ^ ti, 

5%p,ti) = 5{p,U) for all p G {g*+i, . . . ,g„}, 

for all p G {go,-- - ,gi-i}: 5%p,ti) = S{qi,U) if S{p,ti) = q^, 5%p,ti) = S{p,U) 
otherwise. 

Lemma 6 accepts a string S = S\S 2 ■ ■ ■ Sm if and only if S is a subsequence 
of the string T‘ . 



Proof. Let Ti = ti . . -U-i and T2 = ti+i . . .t„. We prove the Lemma in two 
steps. 

1. A‘ accepts all subsequences of T‘: Let us consider maximal j such that 
S1S2 ... Sj is a subsequence of T\ and denote S\ = S\S 2 ■ ■ ■ Sj. Then, s^+i 7^ U, 
or S7+1 = U. 

(a) Sj+i 7^ tp As S'iT 2 is a subsequence of T, it is accepted by A. Therefore 
the following holds: S*{qo, Si) = qk,0 < k < i. As S7+1 7^ U, S{qk, -Sj+i) = 
qi,{i + 1) < I < n and S*{qi, Sj+2 . . . Sm) = qr,l S i" < n. According to 
definition the same transitions exist in A‘. Hence, A‘ accepts S. 

(b) Sj+i = tp. S*{qo,Si) = gfe,0 < fc < i and according to definition the 
same transitions exist in A‘: S^*{qo,Si) = qk- Also the following holds: 
6{qk,ti) = qf) and according to definition 6^{qk,ti) = 6{qi,ti) = qi,k < 
I < n. As SiUsj+i . ■ . Sm is a subsequence of T, it is accepted by A 
and therefore the following holds: S*{qi, S7+1 . . . Sm) = qr,l < r < n and 
according to definition also S^* {qi, sj+i . . . Sm) = qr- Hence, A‘ accepts 

- - - -Sm- 

2. If S' is accepted by A‘, then S is a subsequence of T‘: Let us consider maximal 

j such that i5‘*(go, -si . . . sj) = 0 < fc < (z — 1) and denote Si = si . . . sj. 

According to definition: 6*{qo,Si) = qk- Then, S7+1 7^ U, or S7+1 = U. 

(a) Sj+i 7^ U: Sfqk,Sj+i) = S{qk,Sj+i) = qi,{i + 1) < I < n. As A ac- 
cepts S\ti . . .tn and ti . . . tiSj+i . . . Sm, Si is a subsequence of T\ and 
Sj+i ... Sm is a subsequence of T2. Hence, S1S7+1 ... Sm is a subsequence 
of T‘. 

(b) Sj+i = tp Sfqkjti) = 6{qi,U) = qi,{i + 1) < I < n. Again, as A ac- 
cepts S\ti . . .tn and t\ . . . tiSj+i . . . Sm, Si is a subsequence of T\ and 
Sj+i ... Sm is a subsequence of T2. Hence, S1S7+1 ... Sm is a subsequence 
of T‘. 

□ 



Algorithm: 

Input: DASG for the string T = t\t 2 ... t„ 

Output: DASG for the string =t\ . . . ti-iU+i . . .tn 
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1 : new ^ S{qi,ti) 

2: j ^i-1 

3: while j > 0 and 6{qj,ti) = qi do 
4: 6{qj,ti) ^ new 

5: j ^ j - ^ 

6: end while 

7: for each a G S such that S{qi,a) ^ 0 do 
8: delete the transition i5(gi, a) 

9: end for 

10: g 4- Q \ {q,} 

Time complexity: the algorithm requires 0(n + |i7|) time in the worst case. 



8 Replacing a Transition Label 

The operation is required if we replace a symbol ti of the string T by the symbol 
X. Then, = t\ . . . ti-\xti+i . . . If a; ^ 17, then we define 5{q, x) = % for all 
q G Q. The DASG for the string T‘ is = (Q, S U {x},S\qo,Q), where <5‘ is 
defined as follows: 

(5‘(p, a) = 5{p, a) for all p G g and all a G S,a ^ x,a ^ ti, 

5%p, x) = 5{p, x) for all p G {q^, , q„}, 

6 ^{p,ti) = 5{qi,ti) for all p G g such that S{p,U) = q^, 
i5‘(p, ti) = 6 {p, ti) for all p G g such that S{p, ti) yf qi, 

for all p G {qo,... ,qi-i}. 5%p,x) = q^ if 5{p,x) = 5{q^,x), 5%p,x) = 5{p,x) 
otherwise. 

Lemma 7 A' accepts a string S = siS 2 ... Sm if and only if S is a subsequence 
of the string T‘ . 



Proof. Let Ti = ti . . .U-i and T2 = ti+i . . .t„. We prove the Lemma in two 
steps. 

1. A‘ accepts all subsequences of T‘: Let us consider maximal j such that 
S1S2 ... Sj is a subsequence of T\ and denote S\ = S\S2 . . . sj. Then, sy+i yf x, 
or Sy+i = X. 

(a) Sj+i yf x: As Siti ... is a subsequence of T, it is accepted by A and 
the following holds: S*{qo,Si) = qk,0 ^ k < {i — 1). According to defi- 
nition the same transitions exist in A‘: 6 ^*{qo,Si) = qu. Again, accord- 
ing to definition: 6 ^{qk, Sj+\) = 6 {qk,Sj+i) = qi,{i + 1 ) < I < n and 
6 *{qi,Sj +2 . ■ . Sm) = S'*{qi,Sj +2 . ■ . Sm) = qr,l < T < n. Consequently, 

accepts S. 

(b) Sj+i = x: As S\ti...tn is a subsequence of T, the following holds: 
5 *{qij,Si) = gfc ,0 < A: < (f — 1 ) and according to definition the fol- 
lowing holds: Sfqk,x) = qi. As Sj+2-..Sm is a subsequence of T2, A 
accepts ti . . . tiSj+2 • ■ • Sm'. S*{qi, Sj+2 • ■ • Sm) = qr,i ^ r < n. According 
to definition 5 ^*{qi, Sj+2 • ■ • Sm) = 9 r- Consequently, accepts S. 
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2. If S' is accepted by A‘, then S is a subsequence of T‘: Let us consider maximal 

j such that i 5‘*((70) si • ■ • sj) = 0 < fc < (i — 1) and denote Si = si . . . sj. 

According to definition: S*{qo, Si) = qk- Then, s^+i yf x, or s^+i = x. 

(a) Sj+i yf x: S^{qk, Sj+i) = qi, {i + 1) < I < n and according to defini- 
tion i5‘((7fe, Sj+i) = qi- Consequently, and t\ . . .tiSj+i . . . Sm 

are both accepted by A. This implies that Si is a subsequence of Ti 
and Sj+i ■ ■ ■ Sm is a subsequence of T 2 . Hence, S = SiSj+i . . . Sm is a 
subsequence of T‘. 

(b) Sj+i = x: According to definition: S{qk, x) = qi and 6 ^*{qi, Sj+2 ■ ■ ■ Sm) = 
S*{qi, Sj+2 ■ ■ ■ Sm)- Consequently, SiC...t„ and . . .Usj+2 ■ ■ ■ Sm are 
both accepted by A. This implies that Si is a subsequence of Ti and 
Sj+2 ... Sm is a subsequence of T2. Hence, S = Sia;Sj+2 ... Sm is a sub- 
sequence of T‘. 

□ 



Algorithm: 

Input: DASG for the string T = tit 2 ■ ■ ■ tn 
Output: DASG for the string T‘ = ti . . . ti-ixti+i . . . 

1: j 

2: while j >0 and S{qj,ti) = Qi do 
3 : 6(^qjAi) ^ 

4: end while 

5: j ^ z - 1 

6: while j > 0 and 6{qj,x) = 6{qi,x) do 
7 : S{qj,x)^qi 

8 : end while 

Time complexity: the algorithm requires 0{n) time in the worst case. 

9 Conclusion 

We have defined the following operations on DASG: adding a state on the left, 
deleting a state on the left, adding a state on the right, deleting a state on the 
right, adding an inner state, deleting an inner state, and replacing a transition 
label. For each of these operations, we have provided the modification of DASG 
and the proof of its correctness. 

Note: Dominique Revuz suggested the representation of DASG, in which each 
state has lists of states for each letter of the alphabet. Each state has in a list 
the states that have a transition to this state (for each letter of the alphabet). 
With this representation are some operations on DASG much simpler. 
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Abstract. There are two ways of using the nondeterministic finite au- 
tomata (NFA). The first one is the transformation to the equivalent 
deterministic finite automaton and the second one is the simulation of 
the run of NFA. In this paper we discuss the second way. We present 
an overview of the simulation methods that have been found in the ap- 
proximate string matching. We generalize these simulation methods and 
form the rules for the usage of these methods. 



1 Introduction 

The nondeterministic finite automaton (NFA) is a quintuple M = (Q, 6, 1, F) 

where Q is a finite set of states, is a finite input alphabet, J is a mapping from 
Q X {S U |e}) to set V{Q) of all the subsets of Q, I C Q is a, set of the initial 
states and T’ is a set of the final states. 

NFA cannot be directly used because of its nondeterminism — there are some 
states q € Q from which NFA can move for an input symbol a S if to more than 
one state because it may be |i5(g,a)| > 1. There are two possibilities of using 
NFA: 

1. to transform NFA to the equivalent deterministic finite automaton (DFA), 

2. to simulate the run of NFA in deterministic way. 

DFA has only one initial state, S{q,e) = |g}, Vg G Q, and |J(( 7 , a)| < 1, 
Vg G Q, \/a G N. If DFA has to perform a transition S{q,a), q G Q, a G S, 
such that S{q, a) = 0, the computation ends without reaching any final state. In 
some definitions IHTT7TH there is required |J(g, o)| = 1, Vg G Q, Vo G S. To fulfil 
this condition we can insert new state g' and modify 6 such that J(g^ o) = |g^}, 
Vo G S, and each transition J(g, a) = q G Q, a G S, we replace by transition 
S{q,a) = {q'}. But in this case the DFA has to process whole input string to 
get the result of the computation even if it is known before the end of the input 
string {DFA is in the state that cannot reach any final state). 

* This research was partially supported by grant 201/98/1155 of the Grant Agency of 
Czech Republic and by internal grant 3098098/336 of Czech Technical University. 
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The transformation of NFA to the equivalent DFA using the standard subset 
construction eliminating inaccessible states may lead to DFA with 2"* states and 
take time 0(2'"), where m is a number of states of the NFA. The transformation 
is shown in Figured where eCLOSURE{P) is a set {q' \ q' G S{q,e), q G P}U P. 

After the transformation DFA runs in time 0{n) where n is the length of the 
input string. 



Algorithm 

Input: NFA M = (g, E, S, I, F). 

Output: DFA M' accepting language L{M). 

Method: M' = {Q' , E,S' , q'o, F') where Q' , S', q'q, and F' are constructed in the 
following way: 

q'o := eCLOSURE{I) 

Q' :=0 

S := {q'o} /* the set of not yet processed states */ 

for each q' G S do 
for each a G E do 

S'{q',a) := eCLOSURE{S{q,a)) 
if 5'{q',a) ^ Q' and 5'{q',a) ^ S then 
S' := S U S' {q' , a) 
endif 
endfor 

it q' Cl F A ^ then 
F' := F' U {q'} 

endif 

S-.= S-{q'} 

Q' := Q' U {q'} 

endfor 

Fig. 1. The transformation of NFA to the equivalent DFA without inaccessible 
states. 



On the other hand the simulation of the run of NFA has higher the running 
time but we can save some space and the preprocessing time. In case that the 
space complexity of DFA makes DFA unusable, we have to use some of simulation 
methods. 



2 Naive Method 

In the simulation of the run of NFA the set of active states is held. At the 
beginning of the simulation the set of active states contains all the initial states. 
Then in each step of the simulation a new set of active states is computed by 
the evaluation of the transitions for all active states of the previous step. The 
naive method is very similar to the transformation of NFA to the equivalent 
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DFA using the standard subset construction eliminating inaccessible states and 
is shown in Figure El 



Algorithm 

Input: NFA M — (Q, S, S, I, F), input text T = tit 2 . . . tn- 
Output: Output of run of NFA. 

Method: Set S of active states is used. 

S :=I 
i 1 

while i < n and S 7 ^ 0 do 

S ~ \J^^seCL0SUREi5{q,ti)) 
if S' n F / 0 then 

write(information associated with each final state in S) 

endif 
i \= i + 1 

endwhile 

Fig. 2. The simulation of the run of NFA — the naive method. 



This naive simulation runs in time 0{\Q\n) and space 0(1(511111) where \Q\ 
is the number of states of NFA and n is the length of the input string. All other 
simulation methods are based on this naive method and they differ only in the 
implementation of the set of active states. 

We can directly implement the naive method such that the set of active states 
would be implemented by the bit vector (0 represents that the corresponding 
state is active and 1 represents that the state is not active; the meaning of 0 and 
1 can be also exchanged). This method is shown in Figure 0. Operators AND and 
OR are bitwise operation AND and OR, respectively. In this case it is required 
NFA without £-transitions. This condition does not restrict the simulation since 
each NFA with e-transitions can be transformed to equivalent NFA without e- 
transitions. In Figure 0 we can modify the test, whether a final state has been 
reached, to the test, which final state has been reached. In such case we can 
report the information associated with such final state. 

We have found out two simulation methods called the dynamic programming 
and the bit parallelism. 

3 Dynamic Programming 

In the dynamic programming we divide the set Q of states of NFA to subsets 
<5i, Q2, ■ ■ ■ iQi such that: 

(1) q = 

(2) g, n Qj = %,yhj,i<i<i,i<3 <l,^^ j, 

(3) there can be at most one active state in each subset Qi, 1 < i < 1 . 
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Algorithm 

Input: Number \Q\ of states of NFA, transition table C of size \Q\ x IIII represented 
by the bit vectors, bit vector I of the initial states, bit vector F of the final states, and 
input text T = tit2 ■ ■ - t„. 

Output: Output of the run of NFA. 

Method: Vector Si of active states is used. 

So :=/ 
i := 1 

while i <n and Si-i yf (1, 1, . . . ,1) do 
Si := (1,1,... ,1) 
for j := 1, 2, . . . , IQI do 

if Si-i.j = 0 then Si,j := Si,j AND Cj,t; 

endfor 

if {Si OR F) yl (1, 1, . . . , 1) then 

write{‘NFA has reached a final state’) 

endif 

i \= i + 1 

endwhile 

Fig. 3. The simulation of the run of NFA — the naive method implemented by 
the bit vectors. 



The states in each subset are numbered. Each subset Qi is implemented by 
the integer variable that contains the number of active state in this subset or a 
special value if there is no active state in this subset. 

In the dynamic programming our goal is to minimize the number of integer 
variables (number of subsets) and to implement the transitions in the simplest 
way. 

This method has been used in the approximate string matching using the 
Levenshtein distance which is defined as a searching of all the occurrences of 
pattern P = pip2 ■ ■ -Pm in text T = tit2 ■ ■ ■ t„ such that the pattern can be con- 
verted to the found string using at most k edit operations replace (one character 
is replaced by another), insert (one symbol is inserted), and delete (one symbol 
is deleted). 

The algorithm has been presented in ISelHOI IUkk85l but not as a simulation 
of the run of the NFA. NFA for the approximate string matching, shown in 
FigiirelJl has been presented in llVlelhfi EolDfil and it has m-|-l depths (columns) 
and k +1 levels (rows) — each level of states for each edit distance less or equal 
to k. The way how the dynamic programming simulates the run of the NFA is 
shown in IMelDfil [Hof98al . 

In the dynamic programming there is for each depth of NFA one integer 
variable that contains the lowest number of level of active state on this level or 
value greater than or equal to fc -I- 1 if there is no active state in this depth. The 



^ Symbol p in the figure represents E — {p}. 
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S 




Fig. 4. NFA for the approximate string matching using the Levenshtein distance 
(m = 4, k = 2). 

formula for computing the vector Di, 0 < i < n, of the integer variables is as 
follows: 

dj,o ■■= j, 0 < j < TO 

do,i := 0, 0 < z < n 

dj^i := if ti = pj then (1) 

else min((ij_i_i_i + + 1, 

dj-i^i + 1) 0 < i < n,0 < j < m 

The dynamic programming algorithm has been designed to 

compute the edit distances of the prefixes of pattern P — the Levenshtein edit 
distance between the string ending at position i in text T and the prefix of 
length j of pattern P is dj^i. In this case the reality is described by the vector 
of the edit distances that also describes the active states of NFA. 

This simulation runs in time 0{mn) and needs space 0{m). There are also 
some optimizations of |bel8()[ llJkk85| that differ only in the implementation of 
the matrix D, ie. KTPm with time 0{kn). 

In [Hoi 98b) we have designed another simulation based on the dynamic pro- 
gramming. We have for all the states located on the same diagonal one integer 
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variable because if there is an active state in the diagonal, then all the states 
located lower on the same diagonal are also active due to the e-transitions. 
Therefore for each diagonal we store the level of the highest active state on the 
diagonal. This simulation technique is used in case that we are interested in 
all the occurrences of the pattern with at most k errors but we do not want 
to know the number of errors in the found strings. In this case we can remove 
states located on the diagonal that have less than fc -|- 1 states because these 
states are needed only to determine the number of errors in the found string. 
This simulation runs in time 0{{m — k)n). 



4 Bit Parallelism 

The bit parallelism is very similar to the naive method implemented by the bit 
vectors but in this method we do not need the transition table. The set Q of 
states of NFA is divided to subsets Qi, Q 2 , ■ • • , Qi such that: 

(1) q = 

(2) Q, n Qj J- 



Each such subset Qi is implemented by the bit vector in which each bit 
represents state q £ Qi — 0 represents that state q is active. The transition 
table in this technique is represented by the formula for the computation of 
the vectors in which groups of transitions are computed at once (in parallel — 
therefore this technique is called the bit parallelism). 

As an example of this technique we show Shift-Or algorithm |fjY(I92| for the 
approximate string matching using the Levenshtein distance (there are also the 
modifications Shift- Add ffj Y(I92| and Shift- Add 

pWlV192| 'l. In jHolfifil l5ol97| we have shown that the Shift-Or algorithm sim- 
ulates the run of NFA for the approximate string matching shown in Figure 0 
In this algorithm there is for each level j, 0 < j < fc, of the states of NFA one 
bit vector W . To implement the horizontal transitions the whole vector is shifted 
to the right by using the bitwise operation shfl which inserts 0 at the beginning 
of the vector in order to implement the self- loop of the initial state. Then the 
mask vector is used in order to select only such transitions that correspond to 
the input symbol. It is performed by bitwise operation AND which sets the active 
states in the positions not corresponding to the input symbol to the inactive 
states. 

The implementation of the diagonal transitions is performed by adding the 
shifted vector of level j, 0 < j < fc, to the new value of the vector of level j + 1- 
This adding it performed by bitwise operation OR which inserts the active states 
(represented by 0) to the positions on which there are the inactive states. The 
e-transitions and the vertical transitions are implemented in the similar way. 

^ In the implementation of Shift-Or algorithm the vectors are shifted to the left that 
is more suitable for the case that the vector is longer than the used computer word 
and it should be divided into two or more computer words. 
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The formula for computing the vectors is as follows: 



r. 



r. 



.0 

i.o 

R\ 



= 0, 0 < j < 1,0 < I < k 

= 1, I < j < m,0 < I < k 

= shl(i?°_^) OR D[ti], 0 < i <n 

= (shl(i?ti) 0RD[t,]) 

AND shl(i?'l^ AND AND R]zI, 0 < i < n,0 < I < k 



( 2 ) 



Where each element dj^x, 0 < j < m, x G S, contains 0, if pj = x, or 1, 
otherwise. 

This simulation technique was also used for NFA s for the approximate se- 
quence matching !M7j using the Hamming, Levenshtein, and generalized Lev- 
enshtein distances. 



5 Conclusion 

We have presented two simulation techniques for NFA. They can be used in the 
case that the transformation of the NFA to the equivalent DFA cannot be used 
because of its space complexity. It can be also used in the case that the input 
text is not so long and the time needed for the transformation of NFA to the 
equivalent DFA and the time for the run of DFA together is greater than the 
time of some simulation of the NFA. 
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Abstract. This paper presents an adjustable way of syntactic prediction 
on the basis of left context with token automata. Our purpose is to 
predict the end of a word on the basis of its first letters keyboarding. We 
illustrate our intention with the presentation of a prototype software for 
disabled communication aid, called HandiAS. 



1 Motivation 

This paper presents an adjustable way of syntactic prediction on the basis of 
left context with token automata. Our purpose is to predict the end of a word 
on the basis of its first letters keyboarding. Is it possible? This goal should be 
practically impossible, unless a syntactic study allows to guess its grammatical 
category and unless words are restricted to common vocabulary only... 

We illustrate our intention with the presentation of a prototype software for 
disabled communication aid, called HandiAS [3|. It is an hybrid system, both 
symbolic and statistical. The symbolic part is based on the notion of sentence 
schema and acceptability, notions introduced by Z. S. Harris |3|. A sentence 
schema is easily represented by finite states automaton P! H2i The statisti- 
cal part is based on different studies about words token P P and on the notion 
of token schema that we are defining. 

Figure n presents the interface of HandiAS: we are writing a text with a vir- 
tual keyboard and the system suggests a list of five words after each keyboarding. 

First, we present a definition of a schema m; second, we describe our data 
base: the dictionary and the acceptability tables O; third, we explain how 
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List Qf five 
words 



raif 



a I b I c d I c 



A Pcpos I Lhatjer | 

out. I s.... I 



JlI jJ J_ _aJAJ_aJ _£J 

JJjJ u I ul ii I 



Al_eJJJ_eJ — 



0 0 



Virtuai 

keyboard 



Tuaires seaucoip la rn' 



Writing 

text 



Fig. 1. The interface of HandiAS 



the system works (®, evolves and adapts to user (©• We conclude with some 
results and prospects (»• 

2 Automata and Schemata 

Definition 1 The finite state maehine (Q, L,qo, F,S, X,a) where: 

— L is a set of grammatical categories 

— (Q,L,qQ,F,5) is a deterministic finite state automaton 

— X is a function: Q x L ^ N (transition token function) 

— a is a function: F ^ N (final state token function) 

is name token schemd^ 

The HandiAS system includes three schemata: we also drew our inspiration 
from the representation of F. Debili |2| who distinguishes some conjunctions, 
punctuations or prepositions as break. Therefore, our three schemata represent 
respectively sentences, noun phrases and verb phrases. 

For instance, assume that one wants to write sentences such as: Tu aimes 
beaucoup la musique. {You like music a lot.), Jean aime jouer de la musique. 
{John likes playing music.), Aimes-tu la musique? {Do you like music?) or 
Aimes-tu jouer de la musique? {Do you like playing music?). The schema of 
Figure 121 makes it possible to recognize these four sentences. Of course, our true 



^ An token schema is not a string-to- weight transducer, as in [llj : we can’t minimize 
it; it may not have convergent stated 
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schema is more complex: 107 states and 164 transitions. It shows that we have 
just three possibilities to begin a sentence and it gives their tokens 170, 250 and 
42 (these tokens are made-up digits for the paper). 

This automaton is not fixed, it is changing (see (Q. The transitions labeled 
by NC or VC refer to the two other automata, the noun phrase schema and the 
verb phrase schema. FigureOland Figure Elpresent two schemata, very simplified, 
but enough for our example (in reality, the noun phrase schema has 205 states 
and 289 transitions and the verb phrase schema has 34 states and 70 transitions). 



VP NP 




Fig. 2. An example of sentence schema 



3 The Data Base 

3.1 The Dictionary 

Our dictionary is based on two dictionaries: 

— The Juilland dictionary g] that is made up of about 18 000 inflected words 
(5 083 lemmas), covering 92.43% of words in a text; 

— The Catach dictionary [H that roughly is a subset of the previous one and is 
made up of only 4 000 inflected words (1 620 lemmas), yet covering 90.51% 
of words in a text. 
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PN 




Fig. 3. An example of noun phrase schema 

Vinf 



223 " 46 








Fig. 4. An example of verb phrase schema 



Of course, the 10% of remaining errors represent about two non predicted 
words by sentence, which is a huge quantities. So, this basis dictionary will be 
increased with new words, adapted to the user (©• 

The main role of the dictionary is to manage the token of the user’s current 
words. We store two kinds of tokens: the token of lemma and the token of in- 
flections. So, we correct the low appearance of some inflections of current words. 
First, the real token used is the one of lemma; second, we draw a distinction 
between the different inflections. 



With the same example, assume that we are looking for the word aime (like). 
Table [D presents the tokens of all its inflections that are in our dictionary and 
the token of the lemma (the sum of the other one). 
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Inflections 


aime 


Present, Is 


110 


aime 


Present, 3s 


57 


aimer 


Inflnitive 


42 


aimait 


Imperfect, 3s 


31 








aimeront 


Future, 3p 


1 


Lemma 


aimer 398 



Table 1. Extract of dictionary 



3.2 The Acceptability Tables 

Definition 2 One names aeceptability table, a binary matrix M defined by the 
succession of two terms: 

— if the succession of the term of the row i and the term of the column j is 
acceptable: 

M[i,j] = + 

— else: 

M[i,j] = - 

mn 

The HandiAS system includes three acceptability tables, one for every oc- 
currence schema. They tell us the possible successions between the different 
syntactical categories. 

Still with the same example, Table El presents an acceptability table for cat- 
egories recognized by the noun phrase schema of Figure El One can read on this 
table: 

— a proper noun can begin or ending a noun phrase 

— a determiner can begin but not ending a noun phrase and can be followed 
by a noun or an adjective 



4 The Working of the System 



Now, we explain how the system works with the beginning of same sentence: Tu 
aimes beaucoup la musique. {You like music a lot.). 
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Begining 

of 

phrase 


Proper 

noun 


Det 


Noun 


Adj 


End 

of 

phrase 


Proper Noun 


+ 


- 


- 


- 


- 


+ 


Determiner 


+ 


- 


- 


+ 


+ 


- 


Noun 


— 


— 


— 


+ 


+ 


+ 


Adjective 


- 


- 


- 


+ 


+ 


+ 



Table 2. An acceptability table 



4.1 Before the First Letter 

Before writing, we are at state 0 of the sentence schema with three tokens: 

A(0, NC) = 170, A(0, SubjPro) = 250, A(0, VC) = 42 
and, of course: 

A;/eLA(0,0 =462 

and, at state 0 of the noun phrase schema and the verb phrase schema: 

A(0, PiV) = 67, A(0,Pet) = 532, = 599 

A(0, V) = 306, A:/gLA(0, 1) = 306 

and, thus, we compute four probabilities of syntactical categories: 

PstateoiNC A Det) = 170/462 * 532/599 « 0.33 
PstateoiNC A PN) = 170/462 * 67/599 « 0.04 
PstateoiSubjPro) = 250/462 « 0.54 
PstateoiVC AV)= 42/462 « 0.09 

After, we consult the dictionary and we compute the probabilities of lemmas 
(Table Oj) . 

And we put forward a list of five words: Ze, la, les, V (lemma le, the), il 
(lemma il, he). 



4.2 After the First Letter 

When the user writes the first letter, t, we compute only the words that begin 
by this letter. So, we put forward a new list of five words: tu, t’ (lemma tu, you), 
ton, ta, fes (lemma ton, your). 

As soon as the user chooses the word tu, we write it in the text screen, with 
a space after. Thus, he has done two actions (to click on t and tu) instead of 
three actions (to click on t, u and space). It is not so bad, because this word 
has very few letters. We will see on (0that HandiAS writes a word after two or 
three letters. 
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Syntactical 


Lemma 


Pstateo(Phrase A Cat A Lemma) 


category 










PN 


55 


Jean 


17 


17/55 * Pstat^oiNC A PN) « 0.01 






le 


38585 


38585/54074 * Pstateo(NC A Det) 


w 0.24 


Det 


54074 


un 


13839 


13839/54074 * Pstat.o(NC A Det) 


w 0.08 






ii 


10851 


10851/24481 *P,tateo(S'a6jPro) s 


s 0.24 


SubjPro 


24481 


je 


8774 


8774/24481 * Pstateo(SubjPro) ^ 


0.19 


tu 


4856 


4856/24481 * Pstateo(SubjPro) 


0.11 


V 


63547 


etre 


13190 


13190/63547 * Pstateo(NC A F) R 


0.02 



Table 3. Lemma probabilities 



5 Evolution and Adaptation 

The most important points of the HandiAS system are its evolution and adapta- 
tion to user’s vocabulary and syntax. Without this evolution and this adaptation, 
the data of HandiAS (dictionary and token schemata) are too poor to be effi- 
cient. But, we tailor a software to the individual. And the regular vocabulary 
and syntax of the user are going to make complete the data. Of course, the key- 
boarding of a known word (in the dictionary or in the new word list) leads to 
automatically increment its token. The tokens on token schematon do the same. 

Now, we are explaining how we deal with new words, for instance rap, in 
the sentence Jean aime le rap. (John loves the rap.). This word is not in the 
dictionary. We are on state 2 of the noun phrase schema (Figure 01) and we read 
two transitions: 

A(2, N) = 351, A(2, Adj) = 181, 1) = 532 

With two probabilities: 

Pstate2(A^) = 351/532 « 0.66 
Pstate2(Adj) = 181/532 « 0.34 

So, we suppose that rap is probably a noun, may be an adjective... And noth- 
ing else, as we see on acceptability table (Table El). The occurrence of the word 
rap later evolves like other words of dictionary. However, between the current 
session, this token is increased, so that a rare word, before used in a text, may 
reappear in suggestion lists. 

At the end of current session, HandiAS will propose its hypothesis on the 
word rap to the user. After validation, the word will be put on dictionary. If the 
user intends to never use it also, he can choose to refuse this addition. 
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The syntax evolves too. Assume that one wants to write Oil as-tu appris le 
rap? ( Where do you learn rap?). This sentence begin by a coordinating conjunc- 
tion, then it continues by an auxiliary verb (avoir), a subject pronoun, a past 
participle and a noun phrase, before ending by a question mark. After checking 
that this phrase structure is acceptable in an acceptability table, the sentence 
schema (Figure 0 will be modify as on Figure 0 



VP InP 




Fig. 5. Modification of sentence schema 



6 Results and Prospects 



Today, the HandiAS system is going to implement in C-I--I- language and Win- 
dows NT system. A software company 0 gives with the Handimousse package a 
free version of the interface of HandiAS without suggestions, just virtual key- 
board and annouces the market of the real HandiAS system for next January. 



^ C Technologie, 2 rue Marie-Curie, F44470 Carquefou, France 
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However, we have realized tests with a corpus before the beginning of this 
new implementation. Table^lgives the results of these tests Pj. The total number 
of these actions must be compared with the 10 866 characters of the corpusthat 
we have used for the tests. With these three tests, we note that the HandiAS 
system (dictionary and token schemata) gives best result than a by itself token 
dictionary system or a by itself token schemata system. 





Only with 
token 
dictionary 


Only with 
schema 
system 


HandiAS 


Without 

it 


Selections 


3 680 


3 653 


3 345 


0 


Actions Keyboarding 


3 290 


2 953 


2 791 


10 866 


TOTAL 


6 970 


6 606 


6 136 


10 866 


% of doing actions 


64.15 


60.80 


56.47 


100 


% of saved actions 


35.85 


39.20 


43.53 


0 



Table 4. Test results 



An other important test is to know the number of letters keyboarding before 
writting a word. Tabled shows that our system is yet more efficient than a by 
itself token dictionary or a by itself schema system . 



With our system 


Only with 
token 
dictionary 


Only with 
schema 
system 


HandiAS 


3.13 


2.90 


2.61 



Table 5. Number of keyboarding letters to write a word 



After the realization of this prototype, we think adapt our system to other 
language, because the implementation of HandiAS is independent of the lan- 
guage. We need only to build an other linguistic data base. 
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Abstract. This paper presents an experiment to develop natural-lan- 
guage tools to improve the quality of documents. These softwares are 
using finite-state automata enriched with notions of proximity, option- 
ality and contextual information. They are called bi-directional because 
they need to parse a sequence not only from the left to right-hand side of 
a sentence, but on both sides of a word. This method improves efficiency. 
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1 Introduction: Analyzing Corporate Documents 

0Every document issued by a company is intended to produce a reaction. For 
this reason, the author and those associated with its production are implicated 
in the document sent out. In doing so the company (depending on its activity) 
is running risks which range from time wasted to important failures. It can also 
mean financial loss and major dysfunctioning, for example, in the sending out 
of costing and technical specification^. 



^ This work has been done while the author was working in NEMESIA (Ivry- 
sur-Seine). Information about NEMESIA can be found at the following address: 
http:/ /www. nemesia.com. NTK. FOCUS is a trademark from NEMESIA and can be 
downloaded and tested in different versions at the above address. I would like to 
thank people from NEMESIA for their constant friendship. I want to thank Adeline 
Nazarenko (LIPN), Gregory Perrat (NEMESIA), Marie-Paule Pery- Woodley (ERSS) 
and two anonymous reviewers from WIA’98 for their comments and suggestions on 
earlier versions of this paper. 

^ Most of this work has been done in close collaboration with people working for the 
nuclear industry. In this domain, the risk factor must be strictly controlled, especially 
the documentation. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 1999- 

(c) Springer- Verlag Berlin Heidelberg 1999 
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To be efficient, textual documents must then satisfy quality criteria and re- 
spect pre-determined stylistic norms. Moreover, they should use standardized 
and systematic vocabulary, refer to the dictionary of existing terms, avoid the 
use of fuzzy or imprecise terms, and the overuse of synonyms 0. But, if we can 
find spelling and grammar checkers, stylistic and content checkers are lacking. 

Our aim is then to develop small tools integrated in the working environment 
of experts. Our point of view is that, as far as is possible, the verification of qual- 
ity criteria must be adapted to the rereading of documents, thus facilitating the 
proof-reading and the annotation of the documents analyzed. 

We present here a versatile formalism to represent complex expressions in 
texts, including morphological variations, word reordering and surface transfor- 
mations. This formalism is intended to be integrated in tools to control various 
criteria concerned with fuzzy expressions, requirement analysis and terminolog- 
ical norms. For different reasons that will be discussed, we chose to implement 
enriched automata, that is to say automata able to reflect notions like proxim- 
ity, optionality, etc. They permit an efficient analysis of texts and increase the 
quality of documents. 

2 The Risk Factor in Corporate Documents 

The introduction of new document technology (e.g. groupware, workflow. In- 
tranet, ...) requires a preliminary analysis of risks inherent in the document \n\ 

m- 

2.1 Controlling the Risk Factor 

This task is achieved by: 

— identifying documents sent out or processed by the corporate and the ac- 
companying risks, 

— identifying the authors and the readers of the documents, 

— analyzing the distribution plan for the document, 

— identifying the criteria to be applied to different types of documents depend- 
ing on their inherent risks. 

These criteria, defined in the context of major risk, are flexible enough to be 
adapted to all possible risks. 

We are particularly concerned with documentation of projects that becomes 
overwhelming while teams spend more and more time writing it up. This is the 
only information which survives the projects. However, this documentation is 
never written with a view to future ca. Defining quality control procedures for 
project documents is achieved by identifying the following: 

— the types of documents depending on the phases of the project and their 
future uses. 
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— the writers, their availability at each stage and their responsibility within 
the project, 

— quality criteria to be applied to different types of documents. 

A project, whatever it may be, lives and dies according to its documentation. 
To guarantee quality is to remove one of the main sources of incoherence and 
misunderstanding harmful to good communication between those working on 
the project. The means of validating technical documents, which accompany a 
technical project, principally concern those documents which are standardized or 
partially standardized. When it comes to validating non-standardized language, 
which is its weak point, analytical techniques become subjective and intuitive, 
which is the direct opposite to the demands of quality. 

We developed a method called AVIS to analyze technical texts and identify their 
inherent risks This method was partially implemented in the AVISO toolbox 
(a big lisp program running under UNIX). We are now extracting some parts of 
this toolbox and re-implementing them through small tools devoted to specific 
tasks and integrated in the standard working environment of documentation 
experts. 

2.2 NTK. FOCUS: A Tool to Search Fuzzy Expressions in Texts 

We developed a tool named NTK. FOCUS, to search fuzzy expressions in tech- 
nical texts m. A fuzzy expression is an expression that creates doubt about 
the content of a text. This doubt is revealed by written terms which leave too 
much open to interpretation or which seem to be incomplete or imprecise, and 
this may well lead to an incorrect understanding of the text. Information which 
is too vague amounts to non-information. 

A detailed linguistic study of this phenomenon has enabled us to determine 
different types of fuzziness. In order to detect fuzziness, five types of linguistic 
markers have been identified and these point oui0: 

— a missing element (e.g. in an enumeration): etc. 

— an element which leads to uncertainty and gives no further information: 
peut-etr^ 

— an element which leads to uncertainty while giving more information: prob- 
ablemen^ 

— an element which is fuzzy in a given context: plus (the fuzziness in il faut 
plus de sodiun^ but not in il ne faut plus de sodiunQj . 

— a fuzzy element in relation to a given norm or field: normal (when the nor- 
mal/abnormal characteristics of the event in question have not been speci- 
fied). 

® This work is fully reported in M- 
Tr.: maybe. 

® Tr.: probably. 

® Tr.: more sodium is needed. 

^ Tr.: no more sodium is needed. 
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The development of our dictionaries is cyclic. At the beginning, a collabora- 
tion of experts and linguists permits to establish a first set of fuzzy expressions. 
A first dictionary is then available and validated on real texts. An analysis of the 
results is made in terms of noise (expressions that should not be recognized) and 
silence (expressions that should be recognized). Dictionaries are then corrected 
and validated on new corpus, etc. This work is defined on a linguistic basis and 
our methodology permits to avoid problems related to complete introspective 
systems. Resulting texts must avoid ambiguity, respect industrial standards and 
be easily translated in formal systemtH. NTK. FOCUS functions in the same way 
as other tools integrated in word processort|3. It is interactive because it can be 
used while working on the document : it enables the user to make corrections 
and the re-reader to insert footnotes. It is ultimately up to the writer (or the 
re-reader) to judge whether the expression in the text is acceptable or not. He 
can correct, add notes or ignore it. 

This tool has been developed in close collaboration with re-reader of techni- 
cal documents, in particular experts from the energy industry. Every document 
concerning nuclear energy must undergo several verification stages. The elim- 
ination of fuzzy expression is one of the first stage in the verification cycle. 
NTK. FOCUS is currently used, among all, by experts from the French Institute 
of Protection and Nuclear Safety (IPSN, Institut de Protection et de Securite 
Nucleaire, Fontenay-aux-Roses, France) and from the French energy supplier 
EDF (Electricite de France, Clamart). 

3 Implementing Bi-directional Automata 

Our aim in this context is to build a versatile formalism to represent complex 
expressions in texts including morphological variations, word reordering and 
surface transformations. This formalism is intended to be integrated in tools to 
control various criteria concerned with fuzzy expressions, requirement analysis 
and terminological norms. For reasons of efficiency and disposability, we exclude 
any pre-processing or tagging of the text. Automata are a good formalism for 
our purpose US] HSj: although automata do not offer a deep analysis of texts, 
they are very efficient and robust. 

3.1 Morphological Variations of Words 

We separate each word into a stem, a prefix and a suffix. A suffix family cor- 
responds to the variation set of a stem. We could have taken in consideration 

® Experiments have been done at IPSN to translate natural language descriptions of 
complex systems in K.O.D. (Knowledge Oriented Design) or in the formal B 
language [2|. This means that the texts must not be too much open to interpretation 
nor be incomplete or imprecise. 

® NTK. FOCUS is fully integrated in Microsoft Word 6, 7 and 8 (Microsoft Word 97). 
Dictionaries for French, English and German have been developed. 
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some etymological or phonological criterion to break down words j^l • For exam- 
ple, let’s consider the French stem possible (Figure^) which can only be suffixed 
by (e -I- s -I- mentis. 



-inent 



Fig. 1. Morphological variation of the stem possible. The entrance point is state 
0, exit points are grayed ( states 0 or 1) 



Now, if we take a look at the stem certain (Figure , we can see that it can 
be suffixed by (e -I- s -I- e -I- es -I- ement). We could say that the suffix family 
of possible is the same as the one of certain. The variation is only due to the 
fact that possible has a vowel stem while certain has a consonant one: the -e- is 
then analyzed as an epenthetic vowel appearing before consonant suffixes: 



SUFFIX 2a |) = 




Fig. 2. Morphological variation of the stem certain (1) 



But our aim is more pragmatic: it is easier and more efficient to represent a 
suffix family by an automata composed of a list of endings instead of a composi- 
tion of automata (Figure 0 . This is the automata representing the suffix family 
of certain. This automata is strictly equivalent of the one of figure El 



(I SUFFIX 2b D = 




Fig. 3. Morphological variation of the stem certain (2) 



On figure 0 one can notice that state 0 is the entry point of the automata 
but is also a final state. The stem can stay without any suffix. Finally, we add 
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e represents an empty string. 
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a tag on the suffix family to say if it is optional or not. The automata is then 
equivalent to a simple list of suffixes (Figure 



certain: 

sufflxl: 



lex= ’certain’ 
su{Ex=sufExl 
optional=yes 
num=4 

lex=(’e’, ’s’, ’es’, ’ement’) 



Fig. 4. Morphological variation of the stem certain (3) 



The same is done with prefixes. However, for reasons of efficiency, we generally 
prefer to duplicate the entry in order to have only empty orefixeJi. Languages 
such as German set another problem because they have verbs with particles that 
can be detached in the sentence. For instance, ausgehen (to go out) : ich gehe 
mit meinem Voter aus (I am going out with my father), ich will mit meinem 
Voter ausgehen (I want to go out with my father), ich bin mit meinem Voter 
ausgegangen ( I have been out with my father). When the stem is very eroded 
like in the last example, it is often necessary to create a second entry for the 
word. This problem is solved by the way we represent multi-word entries. 

Although this is a simplified implementation of morphological variations, 
dictionaries for French, German and English have been developed. Our aim was 
to build an efficient and realist way to implement morphology, not to describe 
the very complexity of morphology. For a more complete implementation and 
some considerations on the problem, see m m- 



3.2 Word Combinations in Complex Expressions 

Classically, automata describe complex expressions in a single direction, from 
the left to the right-hand side of a sentence. This is known to be very efficient 
if you are looking for full words, because there is no backtracking during the 
analysis. If you are looking for markers (such as complex determiners or prepo- 
sitional phrases), a lot of wrong analyses will be initiated, because most of the 
time markers begin with a preposition or a determiner, i.e with words among 
the most frequent in European languages. 

To avoid this problem, we have defined in each automata a word called pivot. 
The pivot is a word from which every item of the expression is described . This 
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So that a stem always begins on the left-hand side of a word. Some optimizations 
can then be applied such as a hash table. 
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means we can search for a full word instead of an empty one such as a preposi- 
tion. We have defined rules to choose the pivot: it must be a full word and it is 
generally the semantic head of the expression. 

Other items appearing in the description are described in terms of relative po- 
sition (a word is on the left or on the right of anotherone), of proximity (there 
can be at most 1 or 2 words between the item described) and of optionality (a 
multi-word expression can be recognized with or without certain words). The 
description language allows to say that a word M is on the left of a word M’, 
that they can be separated by at most n words, and that M is or not optional 
in the expression. Finally, two operators have been added to the description 
language: the OR operator (disjunction, if a position can be indifferently filled 
by more than one lexical item) and the AND operator (conjunction, if several 
lexical items are required on both sides of a word or if their ordering is free). 

Let’s take an example with the family of expressions built around the French 
word pew, in analogy with the notion of tree family in Tree Adjoining Grammars 
P. We find on figure^ prepositional phrases and complex determiners expressing 
an idea of quantification. 



Pivot 


Expression 


Peu 


peu 


Peu 


quelque peu 


Peu 


de peu 


Peu 


peu a peu 


Peu 


a peu pres ... 



Fig. 5. Family of expressions built around peu (extract) 



The description uses a bracketed formalism defined according to the afore- 
mentioned constraints. A dictionary is a set of descriptions, each one being 
equivalent to a bi-directional finite state automaton, enriched with information 
on proximity and optionality (Figure |0|). Although bi-directional automaton are 
not so much used, it is a well-known formalism already described in [^ . It makes 
it possible to look for words at the same time on the left and the right-hand 
side of a word. Our description language is relatively rich and can offer several 
strategies to encode a same expression. However, some rules have been defined 
to keep descriptions coherent and explicit. The dictionary is then compiled for 
reasons of efficiency Pj. 

Herein, the dictionary is equivalent to a local grammar m m- However the 
concept of local grammar is not sufficient to describe some of the expressions 
that must be recognized according to the experts (see above, part 2). 
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quelque<G, 1> 




Fig. 6. A part of the bi-directional automaton corresponding to the pivot pen. 
Apart from the pivot peu, each transition refers to a lexical item with indications 
on its position (compared to the item from the preceding transition) and a 
number referring to the proximity. The AND(a <1,G>, pres <D,3> ) expression 
allows to recognize “a peu pres”, where the pivot “peu” is between “a” and 
“pres”. Proximity on the transition from “peu” to “pres” is equal to 3 to be able 
to recognize phrases such as “a peu de choses pres” with insertions between “peu” 
and “pres”. This automaton is able to recognize all the expressions enumerated 
above. 



4 Context Sensitive Automata 

Several tools have been developed with this formalism to increase corporate doc- 
ument quality. But, while we were developing NTK. FOCUS, it appeared that 
the formalism had to be enriched with context and inhibiting information. 

The formalism was thus enriched in order to be able to take the context 
into account and particularly the frames in which the expression loses its fuzzy 
character (See the example above: II faut plus de sodium vs. II ne faut plus 
de sodium). The description of the expression is still in conformity with what 
has been expressed supra. But words are encapsulated in another structuration 
level depending on the context. A positive boolean feature is associated with the 
word if it expresses fuzziness, and a negative one when it reveals an inhibiting 
context. This automaton is in fact a transducer generating a Boolean value. Our 
representation reflects the way we implemented it: the Boolean result of the 
parsing is calculated and percolated during the analysis. However, this is not 
equivalent to a weighted transducer. A weighted transducer permits to increase 
the efficiency of parsing by the addition of heavy weights on the most common 
paths. Our aim here is not to increase the efficiency of parsing, but to be able 
to say that, in a certain context, an expression is no more fuzzy and must not 
be displayed in the interface (Figure CD. 

To recognize a fuzzy expression, the analyzer must be on an automaton final 
state with the positive feature being active. If the negative feature is active, the 
analysis fails. See mg m for problems caused by multiword expressions. The 
only output produced by an automata is generally a Boolean value saying if the 
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Fig. 7. Enriched automata with notion of context. This is the same automaton 
as on Figure 0 where each transition corresponding to a fuzzy expression is 
tagged with a +. On the opposite, the transition corresponding to “pour peu 
que” (“Pour peu que‘” is a locution meaning “if”) is tagged with because pour 
peu que does not constitute a fuzzy expression (“pour peu qu’il soit malade...” 
“Tr.: If he is sick, Only the most extended sequence is recognized by the 
algorithm to avoid recognizing “peu” in the expression “pour peu que”. 



segment of text has been recognized (parsed) or not. Contextual information 
associated with Boolean values permits to limit the number of segment of text 
appearing in the interface (to limit the noise) and to increase the quality of the 
results. This have been proved through experiments with experts from IPSN 
who compared a first version of NTK. FOCUS and the present version. The ac- 
curacy of the result is good according to them and leads to a better quality of 
the expert analysis over the document. A stack is used to parse automata and 
manage backtracking. 

The analysis functions by success/failure: the analyzer looks first for fuzzy 
expressions, then for their negative contexts. When an expression is found with 
the positive feature being active, it is displayed in the interface. The analysis 
fails if the automata has been completely parsed and if no expression have been 
found with the positive feature being active. 



5 Further Developments 

We are now working on more complex redundant structures. For example re- 
quirement analysis need to parse not only markers, but also modal verbs and 
qualification adjectives. Modal verbs can occur with a lot of expressions and 
cannot be manually integrated in each automata. We have then developed of 
a pre-processing stage to expand our dictionary before compilation. This pre- 
processor is a script using the C precompiler and the programming language 
Sed. At the moment, this tool is being used to develop new dictionaries concern- 
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ing requirement analysis m- Such overlapping automata can lead to very large 
entries (up to 300 lines and more for certain complex entries) that cannot be 
processed manually. 

During our experiments on corpora, it appeared that the notion of sentence 
is not so clear as it seems |^. That is not without consequences, particularly for 
requirement analysis. We added to the system a flexible module of text segmen- 
tation to be able to take lists and enumerations into account. It is then possible 
to extract meronymic relations from texts, for example, objects in technical texts 
are often described in terms of decomposition. One main task of the experts is 
to see if the text respects a model of the activity described. This means build- 
ing semi-automatically models and knowledge bases to project texts on these 
models. Our current experiments are being tested in this context at IPSN. 

6 Conclusion 

We have shown in this paper finite-state automata enriched with notions of 
proximity, optionality and contextual information. These automata are called 
bi- directional because they need to parse a sequence not only from the left to 
right-hand side of a sentence, but on both sides of a word. This method improves 
efficiency: the first element searched is most of the time a full word instead of a 
preposition or a determiner. 

We have also increased analysis precision by adding information about contextual 
inhibiting items. It is then possible to search an expression in a certain context 
but not in another one. Negative context is a good way to reduce noise in analysis 
results. Contextual information is completely integrated in the lexical description 
to ensure coherence. 

Lastly, we have tried to show that automata can be used for general purposes, 
particularly for tools concerned with natural language processing. They offer a 
well-known, efficient and versatile formalism that is a real alternative to all ad 
hoc formalisms developed for particular purposes. 
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Abstract. We present a semi-incremental algorithm for constructing 
minimal acyclic deterministic finite automata. Such automata are useful 
for storing sets of words for spell- checking, among other applications. 

The algorithm is semi-incremental because it maintains the automaton 
in near-minimal condition and requires a Hnal minimization step after 
the last word has been added (during construction). 

The algorithm derivation proceeds formally (with correctness arguments) 
from two separate algorithms, one for minimization and one for adding 
words to acyclic automata. The algorithms are derived in such a way 
as to be combinable, yielding a semi-incremental one. In practice, the 
algorithm is both easy to implement and displays good running time 
performance. 

1 Introduction 

In this paper, we present a semi-incremental algorithm for constructing min- 
imal acyclic deterministic finite automata (ADFAs). By their acyclic nature, 
they represent finite languages, and are therefore useful in applications such as 
storing words for spell-checking. In such applications, the automata can grow 
extremely large (with more than 10® states), and are difficult to store without 
first applying a minimization procedure. In traditional minimization techniques, 
the unminimized ADFA is first constructed and then minimized. Unfortunately, 
the unminimized ADFA can be very large indeed — sometimes even too large to 
fit within the virtual address space of the host machine. As a result, incremental 
techniques for minimization (ie. the ADFA is minimized during its construction) 
become interesting. Incremental algorithms frequently have /*■ some /* overhead 
— if the unminimized ADFA fits easily within physical memory, it is still faster 
to use nonincremental techniques. On the other hand, with very large ADFAs, 
the incremental techniques may be the /* only /* option. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 121-^221 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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The algorithm presented in this paper is /* semi /*-incremental (as opposed 
to /* fully /^-incremental, or just /* incremental /*) because it maintains the 
ADFA in a nearly-minimal condition while words are added, but requires a 
simple ‘final’ step to achieve full minimality after all words have been added. 

In order to derive the algorithm, we proceed in three stages: 

1. Derive an efficient algorithm for minimizing ADFAs. 

2. Derive a simple algorithm for adding new words to an ADFA under certain 
conditions. 

3. Combine the algorithms derived in the first two steps. By-design, the first 
two algorithms will manipulate the ADFA in a fashion that makes them 
combinable. 

A great deal of practical work on minimizing ADFAs has been done. Unfor- 
tunately, much of the research is of a proprietary nature and thus forms part of 
the folklore of computational linguistics. Several years ago, Revuz also derived 
incremental minimization algorithms 0. The primary algorithm presented by 
Revuz uses a reverse ordering of the words to quickly compress the endings of the 
words within the dictionary. Further work by Revuz has yielded algorithms which 
correspond rather closely to the one in this paper. All minimization algorithms 
show strong similarities, as can be seen from the taxonomy in The subtle 
differences between the algorithms can lead to domain-specific performance ad- 
vantages for each algorithm. More recently, a paper describing “An Incremental 
Algorithm for Constructing Acyclic Deterministic Transducers” was accepted to 
the /* Workshop on Finite State Machines in Natural Language Processing /* 
P; the current author is also one of the co-authors of that paper. The algorithm 
presented in that paper is entirely different from the one presented in this paper. 
In particular, the following differences are noteworthy: 

— The algorithm presented here is semi-incremental, whereas the two algo- 
rithms presented in the other paper are fully-incremental. 

— The algorithm presented here requires the words (to be added to the ADFA) 
to be in /* any /* order of decreasing length. One of the algorithms appearing 
in the other paper requires the words in lexicographical order while the other 
algorithm does not have an ordering requirement. 

— Thanks to the simple ordering requirement of the algorithm presented in 
this paper, the algorithm is both very simple (and thus easy to implement) 
and very fast. 

— Also, thanks to the simplicity, full correctness arguments are more readily 
provided for the algorithm in this paper. 

This paper is structured as follows: 

— ^presents the necessary definitions of ADFAs. 

— m derives a procedure for minimizing an ADFA. 

— 0 derives a procedure for adding words to an ADFA. 

— ^ combines the algorithms from and El to yield the semi-incremental 
algorithm. 
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— 3SI provides some details on running time and implementation issues for the 
algorithm. 

— 13 presents the conclusions of this paper. 

2 Preliminaries 

In this paper, we consider acyclic deterministic finite automata (ADFAs). The 
algorithm is readily extended to work with acyclic deterministic /*■ transduc- 
ers /*, though such an extension is not considered here. 

A deterministic finite automaton (DFA) is a 6-tuple {Q, r,6,qo, F) where: 

— <5 is the finite set of states. 

— r is the input alphabet. We choose this instead of the more traditional S 
since we will use that letter for ‘summation’ in the section on running time 
analysis. 

— S G Q X r — > Q U {_L} is the transition function. It is a partial function, 
and we use _L to designate the invalid state. 

— 9o C <3 is the start state. 

— F C Q is the set of /* final /* states. 

To make some definitions simpler, we will use the shorthand Fq to refer to the 
set of all alphabet symbols which appear as out-transition labels from state q. 
Formally, 

Fq = {a \ a G F A S{q, a) T } 

All of the algorithms presented in this paper are in the form of the guarded 
command language, a type of pseudo-code — see |21. 

3 Minimizing an ADFA 

In this section, we derive a procedure for minimizing an ADFA. We begin with 
an abstract algorithm (whose correctness is easily determined) and refine the 
abstract details to yield an efficient algorithm. 

The primary definition of minimality of an ADFA M (indeed, this definition 
applies to any DFA, not just acyclic ones) is: 

(for all M', such that M' is equivalent to M, \M\ < \M'\ holds) 

where equivalence of DFAs means that they accept the same language. This 
definition is difficult to manipulate (in deriving an algorithm), and so we consider 
one written in terms of the right languages of states. The right language of a 
state q, written C (g), is the set of all words spelled out on paths from g to a 
final state. Using right languages (and the Myhill-Nerode theorem), minimality 
can also be written as the following predicate (which we call postcondition R)\ 

(for all p,q G Q, such that p ^ q, ~^Equiv{p, q) holds) 
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where we define predicate Equiv to be equivalence of states: 

Equiv{p, q) = ^{p) = ^ (q) 

(Additionally, we require that there are no useless states, though this additional 
restriction is not usually written and we ignore it in the rest of this paper since 
our algorithms have no way of creating useless states.) 

To achieve postcondition R, we introduce a procedure minimize which: 

— assumes the ADFA as a global data-structure; 

— takes (as first parameter) a set of states U, called the unique states, which 
are pairwise inequivalent; that is: 

(for all p,q G U, such that p ^ q, ~^Equiv{p,q) holds) 

— does not shrink the set U of pairwise inequivalent states; 

— takes (as second parameter) another set of states V (which is disjoint from 
U, ie. U nV = ib) which are to be made pairwise unique - those which are 
not unique will be removed since they are redundant. 

To be concise, we define part of our invariant, Ii, as the predicate: 

(for all p,q G U, such that p ^ q, ~^Equiv{p, q) holds) 

In the following presentation of the algorithm, we do not give all of the shadow 
variables or the complete (and lengthy) invariants for a full correctness argument: 



proc minimize{U,V) 

{invariant: U C]V = 0A/i} 

{variant: — V — } 
while FA® do 
let q : q G V 
V~V-{q} 

if there exists p, such that p G U Equiv {p, q) then 
{q is redundant} 
let p : p G UA Equiv {p, q) 
redirect all of q’s in-transitions to p 
remove state q, since it’s redundant 
else 

{q is unique} 

U:=U + {q} 

end if 
end while 

{F = 0} 



Note that the invariant is also a precondition of the entire procedure. Invoking 
the procedure as minimize(ib, Q) would clearly minimize the entire ADFA. 
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The main difficulty with this algorithm is testing Equiv. This algorithm would 
actually be applicable to DFAs with cycles, if we found a way of practically 
implementing the guard; see 0 Chapter 7] for a wide variety of algorithms 
(including an incremental one) that are usable on DFAs with cycles. Fortunately, 
we can use an inductive definition of C : 



Intuitively, a word z is in C (q) if and only if 

— z is of the form az' where a G T is a label of an out-transition from q to 
6{q, a) (ie. a G Eg) and z' is in the right language of S{q, a), or 

— z = e and g is a final state. 

We can begin rewriting Equiv as follows (where each line of the rewriting is 



{e e~t{p) = e e ~£{q)) A Ep = Eg A 

(for all a, such that aGEpHEg, {a} £ {6{p,a)) = {a} C {6{q,a)) holds) 
= { the inductive definition of e G £{p)} 

{p e F = q e F) A Ep = Eg A 

(for all a, such that aGEpHEg, {o} £ (<5(p, a)) = {a} £ ((5(g, o)) holds) 
= { for two languages Lq, Li: ({a}£o = W}Li) = (£q = £i) } 

{pG F = qG F) A Ep = Eg A 

(for all a, such that a G EpH Eg, £ {S{p, a)) = £ {S{q, a)) holds) 

= { definition of Equiv } 

{p G F = q G F) A Fp = Fg A 

(for all a, such that aG EpDEg, Equiv{S{p,a),S{q,a)) holds) 

Clearly, the last step has yielded a recursive definition which, while imple- 
mentable (since we have /* acyclic /* DFAs and the recursion will end), is not 
very efficient (interestingly, this would lead to the algorithm in P, Section 7.4.6]). 
Fortunately, the acyclicity also yields a way to implement this efficiently by re- 
stricting the invariant in the algorithm (as in the following paragraphs). 

The evaluation of Equiv{p, q) would be much simpler if we were assured that 
all S{p, a) and S{q, a) were already in our unique set U (ie. the children S{p, a) 
and S{q,a) are pairwise unique). Since we will use this requirement extensively, 
we write it as predicate Fi{r) (for state r): 

(for all a, such that a G E^, S{r,a) G U holds) 




separated by a /* hint /* corresponding to the rewriting step): 



Equiv{p, q) 

{ definition of Equiv } 



~^{p) = ^ (g) 

{ the inductive definition of £ } 
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Ideally, we could assume Pi{p) A Pi{q), in which case we could rewrite 

Pi{p) A Pi{q) A Equiv{S{p,a),S{q,a)) 

= { definition of Equiv } 

Pi{p) APi(g) A ~t{ 5 {p,a)) = ~t{ 5 {q,a)) 

= { property of C/ stated in invariant /i and assumption Pi{p) /\Pi{q) } 

Pi{p) A Pi{q) A 5 {p,a) = 5 {q,a) 

Here, by introducing the assumption of Pi{p) A Pi{q), we have successfully 
removed the recursion in Equiv. Of course, it remains to ensure that this as- 
sumption holds. We consider the two assumed conjuncts separately. 

To ensure that Pi{q), we introduce an invariant I2 (which is also happens to 
be a precondition) which states that: 

(for all r, a, such that r G V A a G Pr, S{r, a) G V or S{r, a) G U holds) 

(Note that ULlVis not necessarily equal to Q, so this is not as simple as it 
looks.) A predicate such as I2 is trivially true if r has no out-transitions or there 
is no r G O (ie. V = %). Given I2 and the acyclic property of ADFAs, we know 
that for any r G V, there will be a chain of r’s children eventually ending with 
one in U. From this, we conclude that we can always choose /* some /* q G V 
such that Pi{q). 

Turning to P\{p), we introduced another invariant 
(for all r, such that r G U, Pi{r) holds) 

We can now summarize our derivation from the last two derivations above: 

Pi{p) A Pi{q) A {p G F = q G F) A Pp = Pq A 

(for all a, such that a G PpH Pq, Equiv{S{p,a),S{q,a)) holds) 

= { invariants I2 and I3, assumption Pi {p) AP\{q), previous derivation } 

Pi{p) A Pi{q) A {p G F = q G F) A Pp = Pq A 
(for all a, such that a G PpC] Pq, S{p, a) = 6{q, a) holds) 

We label the latter three conjuncts (ie. excluding the assumption) of the last 
predicate P2{p,q). Given this, we can rewrite our procedure. 

This algorithm is still invoked as minimize{% , Q). 

Because of I2 and the P\{q) assumption, this algorithm operates in a bottom- 
up fashion (i.e. from final states towards the start state q^, if we imagine go 
the top of the page, with the final states towards the bottom) on the set V, 
where (by precondition I2) U is ‘hanging’ below V . 

Instead of V being a set, we can reinforce this ‘bottom-up’ order by stipulat- 
ing that y is a stack with those items on the top of the stack being topologically 
‘lower’ (closer to U, and thus further from go) than the items lower in the stack 
— ie. if r is on top of the stack, then Pi (r) . 

Before restructuring the algorithm to use a stack, we rewrite G as a stack in 
invariant I2 {I3 remains the same). states that: 
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proc minimize {U,V) 

{invariant: U C\V = 0A/iA/2A/3} 

{variant: | V | } 
while V do 
let q : q G V A Pi (q) 

V~V-{q} 

if there exists p, such that p G U, P2(p, q) then 
{q is redundant} 
let p : p e UA P2{p, q) 
redirect all of q's in-transitions to p 
remove state q, since it’s rednndant 
else 

{q is nnique} 

U:=U + {q} 

end if 
end while 

{V = 0 } 



(for all r, a, such that r G V A a G Fr, { 6 {r,a) is higher in V, 
or S(r,a) G U) holds) 

This gives the algorithm (where [] is the empty stack): 

To minimize the entire ADFA, we invoke it with minimize{^^V), where V 
is a preorder stacking of the states Q ( /* Any /* preorder will do, since the 
out-transitions from any given state are unordered). 



4 Adding Words to an ADFA 

At first glance, in adding a word w to an ADFA we would simply trace out the 
path (corresponding to w), from the start state go, and make the resulting state 
a final one (add it to F). Unfortunately, this may inadvertently add more than 
one word to the ADFA in the event that some state on the path has more than 
one in-transition. 

Under certain conditions, we can, however, use such a simple procedure. A 
sufficient condition is that, along the w path there must be no state q with more 
than one in-transition. We can state this condition formally as P^{w): 

(for all t, such that 0 < t < |w|, pred(<5*((70) W[o,,i))) < 1 holds) 

where pred{q) is the number of /* predecessors /* (in-transitions) of state q and 
i 5 * is the usual reflexive and transitive closure of the transition function S. 

Condition P3 is trivially guaranteed if we are adding word to a /* trie /* — a 
tree-structured ADFA. We state condition P3 because we will be simultaneously 
minimizing the ADFA, and so we may have a situation where not all states will 
have at most one predecessor. 
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proc minimize {U,V) 

{invariant: U C\V = % /\ Ii /\ I2 f\ I3} 

{variant: | V | } 
while V' 7^ [] do 
q := pop{V) 

{A (9)} 

if there exists p, such that p £ U, P2(p, q) then 
{q is redundant} 
let p : p e UA Pzip, q) 

{Equiv{p,q)} 

redirect all of q's in-transitions to p 
remove state q, since it’s redundant 
else 

{q is unique} 

U:=U + {q} 

end if 
end while 

{v^ = D} 



We can now give the procedure add-Word which takes the word, adds it to 
the AD FA (which is taken to be a global data-structure) and returns the state 
at which the word ends (cand is conditional conjunction): 

{build-state is a function for extending the ADFA with a new transition. It 
returns the newly created state.) The first loop proceeds as far as possible in 
the existing ADFA, while the second loop extends the ADFA, as necessary, to 
accommodate the rest of w. 

5 Combining the Algorithms 

If we are simultaneously minimizing and adding words, we must co-ordinate the 
two algorithms. Recall, from the last version of the minimizing algorithm that 
only states in U have been minimized (and may, therefore, have more than one 
predecessor) and that set U is grown bottom-up (towards qq). To synchronize 
the algorithms, it makes sense to add the word-set W such that their final states 
are also added bottom-up. 

One possible way to order W is to add the words in any order of decreasing 
Length. We say /* any / *, since there are usually several such possible orderings. 
In this case, while adding w, we will not pass through any final states, and it 
will be safe to minimize the portion of the ADFA below the top-most (closest 
to go) final states. We call these top-most final states the /* upper final states 
frontier /* since they form the set of final states which appear first over all paths 
leading away from go. The states below this frontier will become our minimization 
set U. 

After invoking add_word{w), the states at and below the returned state (in- 
clusive) are safe for minimization and can be added to the stack (in preorder) 
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func addjwordiw) : Q 

{Piiw)} 

q,i := qo,0 

{invariant: q = (5*(go, W[o..i))} 
{variant: | w | — i } 
while i <1 w I cand 5{q,Wi) t^_L do 
q,i := 6{q,Wi),i + 1 

end while 
if i <1 w I then 
{S(q,Wi) =_L} 
while i <1 w I do 

q, i ~ build^state{q, Wi), i + 1 
end while 
else 
skip 
end if 
F := FU{(?} 
return q 



using this procedure (which initially takes the returned value of addjword and 
the empty stack): 



proc build-Stack{q, X) : Stack of states 
push(A, q) 
for a ■. a £ Fq do 
if 5{q, a) £ F then 
skip 
else 

X := build_stack{S{q, a), X) 

end if 
end for 
return X 



Along each q-to-leaf path, this procedure stops when it encounters another 
final state (which, by our invariant, will already be in U). 

After adding word w, we minimize the following stack: 

huild-stack{add-Word{w) , []) 

After adding all of the words, we still need to minimize the states above the upper 
final states frontier. This is done with an additional invocation of addjword and 
minimize. 

This yields our final, combined, semi-incremental algorithm: 
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{we have an empty ADFA with a single start state} 
f/ := 0 

{invariant: U is the upper final states frontier and below} 

{variant: | IF | } 

while IF yf 0 do 

let w : w € IF A w is any shortest word in IF 
W:=W- {w} 

minimize{U , buildstack {add_word{w ) , [] ) ) 

end while 

{all have been minimized except those above the final states frontier} 
minimize{U , build^stack {qo , [] ) ) 

{R} 



6 Implementation and Performance 

We begin with an analysis of the running time of the final algorithm. We have 
the following sub-analyses: 

— The outer repetition executes exactly |IF| times. We will actually ignore this 
and separately calculate the total running time of the function invocations. 

— For any word w, add-Word{w) adds 0(|w|) states and takes the same order 
time. 

— Each state is pushed onto the stack exactly once (thereafter it is placed in U 
or removed due to redundancy, and by acyclicity never appears in the stack 
again) . 

— An invocation minimize{U, X) takes 0{\r\ ■ |A|) time. The factor |T| is due 
to the test P 2 {p,q), while — X — isduetotheouterloopofthatprocedure. 

Here, we have assumed that some elementary operations (such as set membership 
in U, stack operations and automata transitions) can be done in constant time. 
The total time taken by add-Word (and also the total number of states) is 

o( E 1^1) 

w.w^W 

This also happens to be the time taken (in total) by buildstack and the order 
of the total stack size. 

Given that each state is pushed onto the stack exactly once, the total time 
taken by minimize is 

o{\r\- E 1^1) 

w.w^W 

It follows that the total running time of this algorithm is: 

0(1^1- E 1^1+ E 1^1+ E H) 

w.w^W w.wGW w.w^W 
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or, equivalently 

om- E 1 ^ 1 ) 

w.w^W 

Interestingly, this running time is asymptotically the same as the running time 
of the best known non-increment al algorithms for ADFAs (they are also linear 
in the size of the ADFA, whose construction is in-turn linear in X)ui-iugw 
N aturally, the semi-incrementality takes its toll in the constant factor. 

In practical terms, for performance reasons, we can do the following: 

— U can be made global. 

— U can be indexed by a state’s finality. 

Complete performance data for this algorithm is not yet available. Preliminary 
benchmarking shows, however, that it is substantially faster (sometimes by a 
factor of 2.3) than Ribbit Software Systems Inc.’s implementations of the two 
algorithms presented in [Q. Those algorithms maintain full minimality (thereby 
incurring the additional cost). Benchmarking data used to-date (with the algo- 
rithm in this paper) have shown that the ADFA remains within 27% of minimal 
(in terms of state count). 

7 Conclusions 

We have presented a simple semi-incremental algorithm for acyclic DFAs — 
indeed, it appears to be the first such algorithm published. The following aspects 
of the algorithm are noteworthy: 

— The algorithm is particularly simple and easily understood. 

— The simplicity of the algorithm goes hand-in-hand with the formal derivation 
and presentation of correctness arguments. 

— In order to formally derive a (semi-)incremental algorithm, novel techniques 
were used, such as: separately developing component algorithms that are 
nondeterministic and using their nondeterminism to synchronize them (co- 
operatively) into the incremental one. 

— The running time of the algorithm is asymptotically as good as the best 
non-incremental ones. 

Acknowledgements: Nanette Saes and Richard Watson were kind enough to 
proofread this paper. The referees also provided a great number of useful com- 
ments, and Jan Daciuk’s recent work inspired me to write up this algorithm. 
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Abstract. We wish to use a given nondeterministic two-way multi-tape 
acceptor as a transducer by supplying the contents for only some of its 
inpnt tapes, and asking it to generate the missing contents for the other 
tapes. We provide here an algorithm for determining beforehand whether 
this transduction always results in a finite set of answers or not. We also 
develop an algorithm for evaluating these finite answers whenever the 
previous algorithm indicated their existence. Our algorithms can also be 
used for speeding up the simulation of these acceptors even when not 
used as transducers. 



1 Introduction 

In this paper we study the following problem: assume that we are given a non- 
deterministic two-way multi-tape acceptor A and a subset X of its tapes. We 
would like to use A no longer as an acceptor, which receives input on all its 
tapes, but as a kind of transducer [3 Chapter 2.7] instead, which receives input 
on tapes X only, and generates as output the set of missing inputs onto the 
other tapes. We then face the following two problems: 

1. Can it be guaranteed that given any choice of input strings for tapes X the 
set of corresponding outputs of A will always remain finite? 

2. Even if problem ^ could be solved positively, how can the actual set of 
outputs corresponding to a given choice of input strings be computed? 

Our motivation for studying these problems arises from string databases 
ponj . which manipulate strings instead of indivisible atomic entities. Such 
databases are of interest for example in bioinformatics, because they allow the 
direct representation and manipulation of the stored nucleotide (DNA or RNA) 
sequences. 

If we assume an SQL-like notation P Chapter 7.1] for the query language, 
then one possible query for such a string database might be stated as follows. 
SELECT X2 

FROM IN R 
WHERE 4>iev{xi,X2) 
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Here (previ^i, X2) is some user-defined expression which compares the strings wi 
and W2 denoted by the variables and X2, say ”w2 is the reversal of wi” . Then 
this query requests every string W2 that is the reversal of some string wi currently 
stored in the database table R. Note in particular that these strings 1x2 need (and 
in general can) not be stored anywhere in the database; the query evaluation 
machinery must generate them instead as needed. 

We have developed elsewhere 0IQI a logical framework for such a query 
language. This framework accommodates expressions like (/>rev(a^i, 2^2) via a mul- 
tidimensional extension of the modal Extended Temporal Logic suggested by 
Wolper ini. The multi-tape acceptors studied here are exactly the computa- 
tional counterparts to these logical expressions. 

A given query to a database is considered to be safe for execution if there is 
a way to evaluate its answer finitely. One safe plan for evaluating the aforemen- 
tioned query would be as follows, where L(Arev) is the string relation accepted 
by Arev, a multi-tape acceptor corresponding to the expression 0rev(a;i, X2). (One 
such acceptor is shown as Fig. Q below.) 
for all strings wi in table R do 
V ^ {W2-. (W1,U>2) G L(Arev)}; 
output every string in V 
end for 

Our two problems stem from these safe evaluation plans. Problem Q] is “How 
could we infer from (f>rev{xi,X2) that the set V is always going to be finite for 
every string wi that can appear in i??” Problem Elis in turn “Even if this is so, 
how can we simulate this Arev (efficiently) for each wi to generate these P?” 
(We have studied elsewhere 0^1 Chapter 4.4] how solutions to problem^guide 
the selection of safe execution plans.) 

One possible solution would have been to restrict the language for the expres- 
sions (f>rev{xi, X2) beforehand into one, which ensures this finiteness by definition, 
say by fixing xi to be the input variable, which is mapped into the output vari- 
able X2 as a kind of transduction PE|. However, in logic-based data models P, 
the use of transducers seems less natural than acceptors, because the concept of 
information flow from input to output is alien to the logical level, and of interest 
only in the query evaluation level. But we must eventually also evaluate our 
string database queries, and then we must infer which of our acceptors can be 
used as transducers, and how to perform these inferred transductions, and thus 
we face the aforementioned problems. 

The rest of this paper is organized as follows. Sect. II . ll oresents the acceptors 
we wish to use as transducers. Sect. II .'il formalizes problem P and reviews what 
is known about its decidability. Sect. P presents our algorithm, which gives a suf- 
ficient condition for answering problemPin the affirmative. Sect, ^presents then 
an explicit evaluation method for those acceptors that this algorithm accepts, 
answering problemPin turn. Finally, Sect. P concludes our presentation. 
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1.1 The Automaton Model 

Let the alphabet A' be a finite set of characters. We also assume left and right 
tape end-markers and *] ’ not in E. Then we define the nth character of a 
given string w € S* with length |w| = m as 

n = 0 
n = m + 1 
1 < n < m. 

Intuitively our automaton model is a “two-way multi-tape nondeterministic 
finite state automaton with end-markers”; similar devices have been studied 
by for example Harrison and Ibarra ^ and Rajlich HU. Formally, a k-tape 
Finite State Automaton (k-FSA) jSJ Section 3 ]HDJ Chapter 3.1] is a tuple A = 
{Qa^Sj\,Fj\,Tj\) with four elements, where (i) Qa is a finite set of states; (ii) 
G Qa is a distinguished start state; (ill) C Q.4 is the set of final states; 
and (iv) r_4 is a set of transitions of the form p > q, where p,q G Qai each 

di,... ,dk 

Ci G A U {[,]}, and each di G {—1,0, -1-1}. We moreover require that di = —1 
implies Ci yf [ and di = -1-1 implies Ci yf ] ; this ensures that the heads do indeed 
stay within the tape area limited by these end-markers. 

A configuration of A on input w = {wi,... ,Wk) G (F*)^ is of the form 
C = {p,n\, . . . ,nfc), where p G Qa and 0 < Ui < jwij -|- 1 for all 1 < i < k. 
This C corresponds intuitively to the situation, where A is in state p, and each 
head i = 1, . . . , A: is scanning square number Ui of the tape containing string 
Wi- Hence we say that (9, ni -I- di, . . . ,rik + dk) is a possible next configuration 

of C if and only if p q G Ta- Now -|-1 can be interpreted as 

di T • • 5 dfc 

“read forward” , while — 1 means “rewind the tape to read the preceding square 
again” , and 0 “stand still” . We call tape i of A unidirectional if no transition in 
T_4 specifies direction —1 for it; otherwise tape i is called bidirectional instead. 

A computation of A on input u? is a sequence C = C0C1C2 ... of these 
configurations, which starts with the initial configuration Co = (s^,0 ,... ,0), 
and each Cy+i is a possible next configuration of the preceding configuration Cj. 
This computation C is accepting if and only if it is finite, its last configuration 
Cf has no possible next configurations, and the state of this C/ belongs to Fa- 
The language L(A) accepted by A consists of those inputs w, for which there 
exists an accepting computation C. 

Because A is nondeterministic, we can without loss of generality assume that 
no transitions leave the final states Fa- We can for example introduce a new 
state /a into Qa, and set Fa = {/a}- Then for every state p previously in Fa, 
and every character combination ci, . . . ,Ck G L7 U { [, ] }, on which there is no 
transition leaving p, we add the transition p /a- In this way, whenever a 

computation of A would halt in state p, it performs instead one extra transition 
into the (now unique) new final state Ja, and halts there. 

We often view A as a transition graph Ga with nodes Qa and edges Ta- In 
particular, a computation of A can be considered to trace a path V within Ga 
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starting from node s^. It is furthermore expedient to restrict attention to non- 
redundant A, where each state is either itself or on some path V from it into 
some state in Fj^. Fig. ^presents a 2-FSA A^ev in this form, where E = {a, b}. 
L{Arev) consists of the pairs {u,v), where string v is the reversal of string u: 
looping in state II finds the right end of the bidirectional tape 1 without moving 
the unidirectional tape 2, while looping in state III compares the contents of 
these two tapes in opposite directions. 



a, [ a, a 




Fig. 1. A 2-FSA for Recognizing Strings and Their Reversals. 



Another often useful simplification is the following. 

Definition 1 Tape i of k-FSA A is locally consistent if and only if every con- 
secutive pair 



Ci,...,Cfc Ci,...,Cfc 

p > q > r 



of transitions in T 4 satisfies the condition 

(di = 0 => c' = Ci) A {di = -1-1 c' [) A {di 



= -1 ^ c- 7^ ]). 



( 1 ) 



This ensures that there are configurations, in which this pair can indeed be taken; 
whether these configurations do ever occur in any computation is quite another 
matter. For example, both tapes in Fig. Q are locally consistent. If in particular 
tape i is both unidirectional and locally consistent, then given any path 

V = SA >■ Pi > P2 > Pz - ■■ Pm 

d^i2),---,d(k,2) d(i^2) ,d(k,3) 

in Ga we can construct an input string Wi = diC^c'^ ■ ■ ■ c'^ for tape i, which 
allows V to be followed, if we choose 

^ I Cbj) if = -kl V j = m) A (c(jj) yf [), and 

^ [e otherwise. 

For example in Fig.ni (;2 spells out the sequence of transitions taken when loop- 
ing in state III. Harrison and Ibarra provide a related construction for deleting 
unidirectional input tapes from multi-tape pushdown automata |3 Theorem 2.2], 
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while Rajlich im Definition 1.1] allows the reading head to scan two adjacent 
squares at the same time for similar purposes. 

Again the nondeterminism of A allows us to enforce Definition Q at the cost 
of expanding the size of A by a factor of (jA] +2): construct a fc-FSA B with 
state space Qe = Qa x U { [, ] }), which remembers the character under tape 
head i. Add for each transition p > q the transitions {p, Ci) > {q, c') 

di,...,dk di,...,dk 

into Tg satisfying Eq. (P). Set finally sg = (s^, [) and Eg = x (A U { [, ] }) 
to complete our construction. 

1.2 The Limitation Problem 

This section introduces our limitation problem |S1 Section 4]P3 Definition 3.3] 
concerning the automata defined in Sect. II . 1 1 

Definition 2 Given a (k + l)-FSA A, determine if there exists a limitation 
function W: ^ N with the following property: if (iti,... ,Uk,vi,... ,vi) G 

L{A), then max{jz)jj : 1 < j < Z} < Wdui] , . . . , \uk\)- 

If this is the case, then we say that A satisfies the finiteness dependency (Ej 
r = {1, . . . , /c} {fc + 1, . . . , /c + Z}. These dependencies are a special case of 
functional dependencies in database theory P Section 8.2]. Intuitively, A is a 
finite representation of the conceptually infinite database table L{A), while F 
assures that if we select rows from this table by supplying values for the columns 
1, . . . , /c, we do always receive a finite answer. In this way A can be safely used 
as a string processing tool within our string database model. 

In terms of automata theory we require that for any input {u\, . . . ,Uk) G 
(A*)^ the possible outputs {vi , ... ,vi) G (A*)^ must remain a finite set. This is 
what is meant by “using acceptors as transducers” : we supply strings for only 
some tapes (here 1, . . . , A:) of the acceptor A, and ask it to produce us all those 
contents for the missing tapes (here k + 1, . . . ,k + ^) it would have accepted 
given the known tape contents. The limitation problem is then to determine 
beforehand whether this computation will always return a finite result or not. 
Weber it™ has studied a related question whether the set of all possible 
outputs on any inputs of a given transducer remains finite, and if so, what is the 
maximal output length. 

The hardness of the limitation problem has been shown to depend crucially on 
the amount of bidirectional tapes in A. The problem is undecidable for FSAs with 
two bidirectional tapes PI Chapter 4.1]: given a Turing machine ^ Chapter 7] 
M one can write a corresponding 3-FSA Am with two bidirectional tapes, which 
accepts exactly the tuples (u,v,w), where v and w together encode a sequence 
of computation steps taken by M on input u. Here v and w must be read 
twice, requiring bidirectionality. Then asking whether Am satisfies {1} {2,3} 

amounts to asking whether M is total. 

The limitation problem becomes decidable if we restrict attention to those 
FSAs with at most one bidirectional tape |Sl Theorem 4.2] [m Chapter 4.2]. 
Intuitively, all the unidirectional tapes are first made locally consistent, after 



138 Matti Nykanen 



which Eq. m allows us to construct their contents at will, so that we can con- 
centrate on the sole bidirectional tape. This tape can in turn be studied by using 
an extension of the well-known crossing sequence construction 0 Chapter 2.6] 
for converting two-way finite automata into classical one-way finite automata. 
This method is clearly impractical, however. Therefore this paper presents in 
Sect. 0a practical partial solution, which furthermore applies even in some cases 
involving multiple bidirectional tapes. 

Example 1. The 2-FSA Arev in Fig- [Dsatisfies both {1} {2} and {2} {1} 

with the same limitation function W(n) = n, because the reversal of a string 
is no longer than the string itself. This is moreover decidable, because only 
tape 1 is bidirectional in Arev To see how limitation inference proceeds consider 
Fig. 0 which exhibits the crossing behavior of A^ev when tape 1 contains the 
string ab. For example, determining {2} {1} involves checking that every 

character written onto the bidirectional output tape 1 is “paid for” by reading 
something from the unidirectional input tape 2 as well, although this payment 
may occur much later during the computation; here it occurs when tape 1 is 
reread in reverse. This can in turn be seen from the automaton B produced by 
the crossing sequence construction by noting that the loops of B around the 
repeating crossing sequence indicated in Fig. 0 consume tape 2 as well. 




Fig. 2. A Crossing Behavior of the 2-FSA in Fig. 0 



The 2-FSA Arev is also considered to satisfy the trivial finiteness dependency 
{1,2} 0 by definition. On the other hand, Arev does not satisfy 0 {1,2}) 

because L(Arev) is not finite. 

2 An Algorithm for Determining Limitation 

Our technique for solving the limitation problem given in Definition|0is based on 
the following two observations. Let A be the (fc -|- Z)-FSA and T = {1, . . . , fc} 
{A:-|-l,... ,k + ^} the finiteness dependency in question. 

1. If A accepts some input {wi,... ,Wk+i) with some computation C, where 
some head k < j < k + l never visits the corresponding right end-marker ‘] ’, 
then A also accepts all the suffixed inputs 

(wi,... ,Wj-i,WjE*,Wj+i, . . . ,Wk+i) 

with the same C. Hence, A cannot satisfy T in this case. 
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2. If on the other hand every accepting computation of A visits ‘] ’ on all output 
tapes, then the only way A can violate T is by looping while generating 
output onto some output tape but without consuming any of the inputs at 
the same time. 

However, Sect. II .2l indicated that reasoning about actual computations is infea- 
sible. Thus we reason instead about the structure of the transition graph G^. 
Therefore, instead of observation Ql the algorithm in Fig. 0 merely tests that 
there is no path V from the start state into a final state, which never requires 
‘] ’ to appear on some output tape, whereas it would have sufficed to show that 
no such V is ever traversed during any accepting computation. (B denotes the 
Boolean type with values 0 as ‘false’ and 1 as ‘true’.) 

Similarly, the algorithm in Fig. 0 enforces a more stringent condition than 
observation S every cycle C in G^, during which some output tape is advanced 
to direction -|-1, must also move some input tape i into direction ±1 but not 
back into the opposite direction =pl- Then this tape i acts as a clock, which 
eventually terminates the potentially dangerous repetitive traversals of C. Again, 
if A violates F, then some £' failing this condition must exist in G^, but the 
converse need not hold, because repetitions of £' need not necessarily occur 
during any accepting computation of A. This condition is enforced by repeatedly 
deleting transitions, which cannot take part in any C' of this kind. This technique 
is related to analyzing the input-output behavior of logic programs IBCll, which 
analyze the call graph of the given program component by component. 

More precisely, the edge deletions made by the algorithm in Fig. 0 can be 
justified as follows. Consider the first call made by the main algorithm in Fig.0 
Every loop mentioned in observation 0 must clearly be contained in some com- 
ponent Hi of G = G^, the entire transition graph of the fc-FSA A being tested. 

1. A transition between two different strongly connected components cannot 
then surely belong any such loop. The deletions in step 3 are therefore war- 
ranted. 

2. Any transition r that winds the clock tape j selected for the current compo- 
nent Hi cannot belong to any such loop either, because r cannot be traversed 
indefinitely often. These traversals will namely wind the input tape j onto 
either end-marker, because tape j is not wound into the opposite direction by 
any other transition r' in Hi . The deletions in step 6 are therefore warranted 
as well. 

This reasoning can then be applied in the subsequent recursive calls on the 
reduced components Hi as well, because we can then assume inductively that 
the loops broken during the earlier calls could not have been ones mentioned in 
observation 0 

Formalizing this reasoning shows that Fig. 0 is indeed correct, as follows. 

Theorem 1. Let A he a {p + q + r)-FSA with tapes 1, . . . ,p unidirectional, and 
let the algorithm in Fig. 0 return 1 on A and {!,... ,p + q\. Then A satisfies 
{!,... ,p + q} {p -I- g -I- 1, . . . ,p + q + r} with the function 
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function halting( G transition graph Ga of a fc-FSA A\ 

Jf:subset of {1, . . . , 

1 : 6 ^ 1 ; 

2: for alH e {1,... ,fc}\ A do 

3: H ^ G without transitions that specify ‘] ’ for tape i; 

4: 6 «— 6 A {H contains no path from into any state in Fa) 

5: end for 
6: return b 
end halting; 



Fig. 3. An Algorithm for Testing Observation D1 



function looping( G:subgraph of Ga for a fc-FSA A; 

A:subset of {1, . . . , 

1: 6^1; 

2: Hi, . . . , Hm <— the maximal strongly connected components of G; 

3: Delete from G all transitions between different components d); 

4: for all i <— 1, . . . , m do 

5: if some tape j £ X winds to direction ±1 in Hi but not in direction =pl then 

6: Delete from Hi all transitions that wind this tape j 0); 

7: d ^ looping(77i, A) 

8: else 

9: d <— no tape in {1, . . . , fc} \ A winds into direction +1 in idi 

10: end if 

11: b ^ b A d 

12: end for 
13: return b 
end looping; 



Fig. 4. An Algorithm for Testing Observation 



function limited( A:fc-FSA; 

A:subset of {1, . . . , fc}):B; 
1: return halting(G>r, A) A looping(G^, A) 
end limited; 



Fig. 5. An Algorithm for Determining Limitation. 
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W{mi, . . . ,mp,ni, ... ,Uq) = g^(mi, . . . ,mp,nx , ... ,n,) - 1, where 



5 ^(toi,... ,mp,ni,... ,n,) = \Qa\ ( max(p, 1) + ^ m* JJ(nj + 2) 

V i=l j=i 



p q 




Proof. Let us assume that C is an arbitrary computation of A on some input 



We begin by proving the following two claims about this C, which correspond 
to observations n and [21 

1 . If C is accepting, then for every tape p + q+1 < j < p + q + r C takes some 
transition, which requires ‘] ’ on tape j. 

Let otherwise h he a tape, which violates this claim 0 C traces a path through 
Ga from into some state in Fa- Then step 4 of the algorithm in Fig. El sets 
6 = 0 when testing i = h, which violates our assumption that the algorithm in 
Fig. 0 returns 1, thus proving this claim 0 

2. No head p + q + h moves to direction +1 more than 

l = W{\ui\,... ,\up\,\vi\,... ,|n,|) + l 
times during C. 

Assume to the contrary that some h violates this claim 0 Post a fence between 
two adjacent configurations Cg and Cg+i in C whenever tape p + q + h moves 
to direction +1. By our contrary assumption, at least ^ + 1 of these fences are 
posted. Consider on the other hand two configurations Cx and Cy of C to have 
the same color if and only if they share the same state and the same head 
positions for tapes 1, . . . ,p+ q. At most I of these colors are available, recalling 
the assumption that tapes 1, . . . ,p are unidirectional. Therefore C must contain 
two configurations Cx and Cy, which have the same color, but are separated by 
an intervening fence. Consider then the sequence of transitions, which transform 
Cx into Cy, as a path C within Ga- This C forms a closed cycle, and tape heads 
1, . . . ,p + q are on the same squares both before and after traversing C, because 
Cx and Cy shared a common color. Let us then see which of the steps 3 or 6 
of the algorithm in Fig. Swill first delete some transition that belongs to C. It 
cannot be step 3, because all of C belongs initially to the same maximal strongly 
connected component. But it cannot be step 6 either, because if £ ever moves a 
tape j € X into some direction ±1, it must also move tape j into the opposite 
direction =pl as well, in order to return its head onto the same square both 
before and after £. Hence £ persists untouched to the very end of the recursion 
on step 9, and there the presence of the transition of £ that crosses the fence 
between Cx and Cy yields d = 0, which subsequently violates our assumption 
that the algorithm in Fig. 0 returns 1 , thus proving this claim 0 

Claims 0 and 0 are combined into a proof of the theorem as follows. Assume 
that 2 ; G L{A); that is, A has some accepting computation C on input z. It 



2 : = {ui, ... ,Up,Vi,.. . ,Vq,Wi, ... ,Wr) ■ 
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suffices to show that | < Z — 1 for every 1 < h < r. Tape head p + q + h must 
cross every border between two adjacent tape squares from left to right, because 
otherwise C would not meet claim Q Claim 0 states in turn that C performs at 
most I crossings of this kind. This means that tape p+q+h contains at most Z + 1 
squares, of which the first and the last are reserved for the end-markers, leaving 
at most I — 1 squares for the characters of the input string Wh- □ 

Example 2. Consider the 2-FSA A^ev in Fig. 0 The algorithm in Fig. 0 can 
detect that it satisfies {1} {2} as follows. The algorithm in Fig. |3 returns 1 , 

because every path into the final state IV must contain III IV. 

Evaluating the algorithm in Fig. 0| proceeds in turn as follows. First, all 
transitions from one state into another are deleted in step 3, leaving only the 
loops around states II and III. This is depicted in Fig. El where the components 
themselves are dotted, and the transitions between them (and thus deleted in 
step 3) are dashed. These loops are in turn deleted in step 6 when processing 
the corresponding components, and therefore this function eventually returns 
1 as well. However, Theorem Q provides a rather imprecise limitation function 
W’(n) = 4n -|- 7 compared to the one given in Example D 




Fig. 6. The division of the 2-FSA in Fig. Q into components. 



On the other hand, the algorithm in Fig. 0 fails to detect {2} {!}, which 

Exampleddid detect: looping in state II advances tape I without moving tape 2, 
and seems therefore dangerous to the algorithm in Fig. 0 Intuitively, Arev first 
guesses nondeterministically some string, and only later verifies its guess against 
the input. Example 0 examined crossing sequences to see that this later checking 
in state III indeed reduces the acceptable outputs into only finitely many (here 
just one). 

Essentially the same limitation function as in Theorem 0 suffices whenever 
all of the output tapes p + q + 1, . . . ,p + q + r to he limited are unidirectional, 
even if {1, . . . ,p -|- g} {p + q + 1, . . . , p + q + r} cannot be verified by the 
algorithm in Fig. 0 0 Theorem 2.1]. This is natural, because the algorithm in 
Fig. 0ignored the effects of moving any output tape p + q+ 1, . . . ,p + q + r into 
direction —1. 
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Theorem 2. Let the only bidirectional tapes of the {p + q + r)-FSA A he p + 
1, . . . ,p + q, and let A satisfy F = {1, . . . ,p + q} {p + q + 1, . . . ,p + q + r}. 
Then 

WfiTii, . . . ,TOp,ni, ... ,rig) = (iri + . . . ,mp,ni , ... ,n,) - 1 

is a corresponding limitation function, where is as in Theorem^ 

Proof. Consider the proof of Theorem Q and assume further in claim |21 that 
all the output tapes p + q + 1, . . . ,p + q + r are made locally consistent as in 
Definition n this new assumption introduces the factor (llil + 2)’" into W'. With 
this modification, the original fencing-coloring construction shows that if some 
accepting computation C on input z advances some output tape p + q + h more 
than W'(|mi| , . . . , |up| , |fi| , . . . , |fg|) -I- 1 times, then the path of transitions 
taken by this C can be partitioned into three sub-paths JCCM., where C begins 
in Cx and ends in Cy, which share the same color, and contains a transition r 
that crosses some fence between Cx and Cy. However, now A must also accept 
all the pumped inputs 

, . . . , Up, V\ , ... , Vq , ^(l,/C)^(l,C) 5^(r,/C)^(r,C) ^ 5 

where t £ N, and each wi^k.j) denotes the string of characters in those squares of 
output tape p + q + k, onto which the head lands during J (‘] ’ excluded). This is 
in effect an application of Eq. (IB to the output tapes p + q+1, . . . ,p + q + r. The 
presence of r within C shows that e, and hence F fails by observation^ 

thereby proving this modified claim |B 

Claim n continues to hold, as reasoned in observation Q and the theorem 
again follows as before. □ 

Turning now to assessing when the algorithm in Fig. 0does detect finiteness 
dependencies, we see that it is successful at least when all tapes are unidirec- 
tional. 

Theorem 3. Let A he a non-redundant (k + l)-FSA with all tapes unidirec- 
tional and locally consistent, and let the algorithm in Fig.\^return 0 on A and 
{1, . . . , k}. Then A does not satisfy F = {1, ... , fc} ^ {fc -|- 1, . . . ,k-\-l}. 

Proof. The non-redundancy of A and the unidirectionality and local consistency 
of all its tapes imply by Eq. 0 that for every path V in we can always find 
an accepting computation C on some input w traversing V. Letting V then be 
any subgraph of G^, which caused the algorithm in Fig. El to return 0 yields 
some C, whose existence violates F along the lines of Theorem 0 □ 

3 Evaluation of the Limited Answers 

Having inferred the (/c-|-/)-FSA A to satisfy F = {!,... ,k}-^{k-\-l,... ,k-\-l}, 
we then want to generate the (finite) set of outputs v = {vi ... ,vi) for a given 
input u = {ui , . . . , Uk). This problem is known to be difficult in general: let B be 
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a 2-FSA with an unidirectional input tape 1 and a bidirectional output tape 2, 
and ask if a given input u can produce any output v. This problem is equivalent 
to whether B, considered as a checking stack automaton, accepts u im Theo- 
rem 5.1], which is known to be either PSPACE- or NP-complete, depending 
on whether B' is a part of the instance or not 0 Problem AL5]. However, the 
additional information T provides certain optimization possibilities. 

A straightforward way to obtain an evaluation algorithm is to convert the 
output tapes from read-only into write-once, and perform these writing oper- 
ations concurrently with the simulation of the nondeterministic control. Fig. 0 
shows the resulting algorithm, where the simulations of all the possible com- 
putations are performed in a depth-first order using a stack S. The algorithm 
maintains for each 1 < j < Z an extensible character array Wj[Q . . .Lj\, which 
holds the contents for the tape squares the output head k-\- j has already exam- 
ined during the computation C of A currently being simulated. Fig. 0 shows the 
indices during one simulation of the 2-FSA Arev from Fig. 0 where the input 
tape 1 contains the string ab, whose reversal is being generated onto the output 
tape 2. 

In Fig. 0 is enumerated as ti, . . . , and tq is a new starting pseudo- 
transition • • !• syi. We also assume as in Sect. Il.ll that no final state in Fa 

0 ,... .0 

has any outgoing transitions. 

Note that F alone does not guarantee that the algorithm in Fig. 0 halts, 
it just guarantees that only finitely many different outputs are ever generated. 
Consider namely the situation in Fig.0 where the 2-FSA Arev in Fig- Qis being 
used as a transducer in the opposite direction to Fig. 0 input is read from 
tape 2 and written onto tape 1. As explained in Example |2 Arev is now guessing 
nondeterministically a possible output for later verification against the input. 
But how long guesses should Arev be allowed to make? 

This question can be answered by noting that if the currently simulated 
computation C on input u has written more than 

B = W{\u\,... Mk\) (3) 

characters onto some output tape, then C must eventually reject by Definition |3 
Hence we can safely add the extra condition max(Li,... ,L;) < B 1 into 
branch 15 of the algorithm in Fig. El to prune all computations C that attempt 
to generate too long outputs. 

Thereafter the stack S will always contain only finitely many different con- 
figurations C of the transducer being simulated on input u. (These transducer 
configurations C can be defined in a straightforward by extending the acceptor 
configurations defined in Sec. II .1 1 with write-once output tapes.) Although S 
represents these configurations C only implicitly, they can be recostructed as in 
branch 10 of the algorithm in Fig. El However, some of these configurations C 
can still repeat, because the transducer being simulated can also loop on the 
already known parts of its tapes without generating new output. Fortunately 
this looping can be detected and eliminated simply by testing in branch 15 of 
the algorithm in Fig. El that the new configuration Cnew being pushed into S 
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Fig. 7. Simulating the 2-FSA in Fig. Q] as a Transducer. 



procedure simulate( A :(fc + i)-FSA; 

Ml, . . . ,Uk-S*)\ 

1: Set mi ^ 0 for all 1 < i < fc; 

2: Set Uj ^ 0, Lj <— 0 and Wj\Lj] ^ [ for all 1 < j 
3: Set t <— 1; 

4: Initialize stack S to contain (0; Li, . . . ,Li) only; 

5: while S is nonempty do 

6: Let {t'\ L'^, . . . , L[) be the top element in S, and let Tt' = p > q\ 

d\ , . . . ,e\ ,e', 

7: if q € Fa then 

8: output(iii, ... ,vi), where each vj = Wj[l] . . . Wj[Lj — 1]; 

9: Set t ^ \Ta\ + 1 

10: else if t > \Ta\ then 

11: Set rui -i— rrii — d'j for all 1 < i < fc; 

12: Set rij ^ rij — e'j and Lj <— L' for all 1 < j < 1; 

13: Pop off the top element from S'; 

14: Set t ^ F + 1; 

15: else if Tt = g — ^ satisfies {rij < Lj) {Wj[nj] = Cj) for 

di , . . . ,ei , . . . 

1 < j < l then 

16: Push (t; Li, . . . , Li) into S; 

17: Set rrii rrii + di for all 1 < i < fc; 

18: Set Lj ^ Lj + 1 and Wj[Lj] <— Cj for all those 1 < j < I, on which rij > Lj 

holds; 

19: Set rij ^ rij + Cj for all 1 < j < 1; 

20: Set t ^ 1 

21: else 

22: Set t ^ t + 1 

23: end if 

24: end while 
end simulate; 



Fig. 8. An Algorithm for Using Acceptors as Transducers 
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Fig. 9. Generating an Indefinitely Long Output with the 2-FSA in Fig. [0 

does not yet occur in S. This is a standard way to avoid repetition during a 
depth-first search Chapter 3.6]. We have also experimented with comparing 
Cnew against all the configurations C encountered so far in the entire search 
conducted by the algorithm in Fig. 0 on the current input m, but this proved to 
be extremely inefficient in practice. 

Now we have solved problem El by developing a halting variant of the evalu- 
ation algorithm in Fig. 0 However, this solution suffers from two drawbacks: 

1. The cut-off value value B in Eq. o is needed to estimate — and hopefully 
tightly — the pruning depth. 

2. The whole stack S must be scanned against repeating configurations C when 
pushing each new configuration Cnew 

Fortunately these drawbacks can be alleviated by considering how F was inferred 
to hold. 

Let r be inferred by having the algorithm in Fig. 0 return 1 on A and T. 
Claim 0 in the proof of Theorem 0 shows that every computation C of A is 
“self-limiting” in the sense that no Lj can grow indefinitely. Thus the bound B 
in Eq. (3D is not needed after all, thereby alleviating drawback 0 

Claim 0 alleviates also drawback 0 Two occurrences Cx and Cy of the same 
configuration during C have the same color by definition. The proof of the claim 
shows that Cx and Cy can only arise by traversing a closed loop C, which is not 
deleted during the algorithm in Fig. 0 We therefore modify this algorithm to 
mark in A the transitions it considers deleted. Then the algorithm in Fig. 0can 
stop scanning its stack S as soon as the most recent marked transition is seen. 
(In addition, these C do not advance any output tape k+j into direction -|-1, and 
therefore this marking technique could be extended even into those transitions 
in £, which move an output tape into the opposite direction —1 instead.) 

Example 3. Applying this modified algorithm in Fig. 0 into the 2-FSA Arev in 
Fig. 0 and r = {1} {2} in this way marks every transition by Example 0 

Hence the algorithm in Fig. 0 suffices unmodified in this case: a compile-time 
check showed that possibly expensive run-time loop checking is in fact unneces- 
sary, leading into an efficient simulation method for Arev 
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Note that this marking technique can also speed up the simulation of those 
to-FSAs B, which are still used as acceptors and not transducers: compute the 
marking given by the modified algorithm in Fig. 0with B and {1, . . . , m} (which 
yields 1), and use the resulting stack scanning optimization strategy during the 
simulation of B on any given input (iti, . . . ,Um) to identify transitions, which 
cannot take part in any loops, where a configuration repeats. 

A related strategy works also in the case of Theorem El Its proof shows that 
a halting but still correct variant of the algorithm in Fig. El can be obtained 
by adding into its branch 15 the extra condition that configurations Cx and Cy 
of the same color may not repeat within any computation C: if the path C of 
transitions from Cx into Cy advanced any of the unidirectional output tapes 
p + 9 + l,... ,p + g + r, then this C must be rejecting, while otherwise taking 
C during C was unnecessary. Then a variant of the algorithm in Fig. ^ which 
attempts to mark every transition it can (instead of trying to test for T and 
fail), identifies those cycles C that can cause some color to repeat. Therefore the 
same stack scanning optimization strategy as above still applies. 



4 Conclusions 

We studied the problem of using a given nondeterministic two-way multi-tape 
acceptor as a transducer by supplying inputs onto only some of its tapes, and 
asking it to generate the rest. We developed an algorithm for ensuring that this 
transduction does always yield finite answers, and another algorithm for actually 
computing these answers when they are guaranteed to exist. In addition, these 
two algorithms provided a way to optimize the simulation of nondeterministic 
two-way multi-tape acceptors by restricting the amount of work that must be 
performed during run-time loop checking. 

These algorithms have been implemented in the prototype string database 
management system being developed at the Department of Computer Science 
in the University of Helsinki. 
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Abstract. Applications described by Sequential Function Chart (SFC) 
often being critical, we have studied the possibilities of program chec- 
king. In particular, physical time can be handled by SFC programs using 
temporisations, that’s why we are interested in the quantitative temporal 
properties. We have proposed a modeling of SFC in timed automata, a 
formalism which takes time into account. In this modeling, we use the 
physical constraints of the environment. Verification of properties can 
be carried out using the model-checker Kronos. We apply this method 
to SFC programs of average size like the one of the controlling part of 
the production cell Korso. The size of the programs remaining however 
a limit, we are studying the means of solving this problem. 



1 Introduction 

The language of control in which we are interested is Sequential Function Chart 
(SFC is the English name of Grafcet). Developed since 1977, this graphical lan- 
guage is based on the step-transition model. Through temporisations, it makes 
it possible to take time into account. This intuitive and practical language has 
shown its perfect adaptation to the programming of automated systems. It’s one 
of the languages defined by the IEC1133— 3 to program Programmable Logic 
Controllers. For these, safety is needed ; it is necessary to make sure that spec- 
ifications are respected by the program. To carry out these verifications, SFC 
has been modeled in various formalisms which have verification tools. 

These modelings have limits: time is not taken into account. However time 
plays an important role in the command of many automated systems (for ins- 
tance the timeouts). Thus it is important to treat it. This is why we are interested 
in temporized SFC and in its temporal verification. 

After having presented the main principles of SFC, we will justify the choice 
of the timed automata for the modeling of SFC. Then this modeling will be 
described as well as the verifications which it makes possible. Finally we will 
indicate how we take into account the constraints of the physical world and how 
the size of the automata can be reduced. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 149-^23 1999- 
(c) Springer- Verlag Berlin Heidelberg 1999 
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2 SFC 

SFC 12 is a chart model of the behaviour of the control parts of an automated 

system. 

2.1 Structure 

The basic graphical elements are (see Figure QJ: 

— the steps which represent the various states of a system. They are symbolized 
by squares. The initial steps are represented by double squares. During the 
evolution of an SFC program, the steps are either active or inactive ; during 
initialization, only the initial steps are active. The set of the active steps 
of an SFC program at a given moment defines the situation of this SFC 
program; 

— the transitions which are used to control the change from a state to another 
one. They are represented by a horizontal line and control the evolution from 
step to step. They have two values ; they can be validated or not validated. 
A transition is validated when all the steps preceding it are active. With any 
transition, a receptivity is associated, i.e. a Boolean function of the inputs 
and internal variables of the SFC program like step variables which test if 
a step is active or not. If a transition is validated and its receptivity has a 
true value, then this transition is fireable. 

In the receptivities, a particular function renders it possible to measure time: 
temporisation. The temporisation tijXijt2 indicates a Boolean condition 
which takes the value of true if the step i remains active at least ti units of 
time and which will become false t2 units of time after the deactivation of 
step i. No structural relation is imposed between the use of temporisation and 
the referred step i (in the temporisation ti/Xi, the value of t2 is implicitly 
0 ). 



2.2 Behaviour 

Two postulates define the conceptual framework in which SFC must evolve: 

— Postulate 1: All the events are taken into account as soon as they occur 
and for all their incidences. 

— Postulate 2: In the SFC model, causality is considered at null time. 

From these two statements, it should be noted that SFC model is sensitive to 
any external event, whatever its time of occurrence. Any change in the external 
environment must be taken into account, and the reaction induced must be 
calculated at null time. 

The five following rules define the evolution of an SFC program. 

— Rule 1: In the beginning, only the initial steps are active. 

— Rule 2: A transition is validated if all the preceding steps are active. A 
transition is fireable when it is enabled and its receptivity is true. 
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— Rule 3: A fireable transition is immediately fired. The immediately following 
steps are then activated and the immediately preceding steps are deactivated. 
Activations and deactivations are performed simultaneously. 

— Rule 4: If, in an SFC program, several transitions are simultaneously fire- 
able, they are fired simultaneously. 

— Rule 5: If a step is simultaneously activated and deactivated, it remains 
active. The priority is given to activation. 



2.3 Interpretation 

The behaviour of an SFC program is described by the five rules of evolution. 
Those are supplemented by interpretation algorithms. The main interpretations 
are named No Search for Stability (NSS) and Search for Stability (SS). 

— NSS Interpretation: In the case of the NSS interpretation, a step of evolu- 
tion corresponds to a simple evolution, that is, the simultaneous firing of all 
the fireable transitions. Carrying out a step of simple evolution corresponds 
to the acquisition of inputs, to the computation of the new situation and its 
output towards the external world. 

— SS Interpretation: In the case of the SS interpretation, a step of evolution 
corresponds to an iterated evolution, that is, a simple evolution with acqui- 
sition of the inputs followed by a continuation, possibly empty, of simple 
evolutions without acquisition of inputs, until obtaining a stable situation. 
A situation is stable when no transition is fireable without taking new in- 
puts. A cycle of instability is a sequence of simple evolutions not leading to 
a stable situation. 

Despite rules and interpretations, ambiguities still persist in the description 
of SFC programs. For the following modelings, the choices which were made, are 
detailed in HH. 

The SFC program of Figure Dl will illustrate the various points of our talk. 
At the beginning, steps 0 and 10 are active and it is supposed that input A is 
false. Transitions 0 and 2 are thus validated but not fireable. Three evolutions 
are then possible: 

— input A becomes true before 10 units of time. In this case, transition 0 is 
fireable. Its firing causes the deactivation of step 0 and the activation of step 
1. The situation { 1,10 } is reached, applying rule 3. 

— 10 units of time run out without A becoming true. Transition 2 is fired, the 
situation { 0,11,12 } is reached, applying rule 3. 

— input A becomes true exactly 10 units of time after the activation of step 
10. Transitions 0 and 2 are simultaneously fired. The situation reached is 
{ 1,11,12 }, applying rules 3 and 4. 

The SFC program then continues to evolve from the current situation. 
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Fig. 1. An SFC program 



3 Checking by Using Timed Automata 

’’Synchronous” languages have been proposed to answer the problems of safe 
programming. The basic assumption of these languages stipulates that the out- 
puts be considered simultaneously with the inputs that generate them. The SFC 
language also makes this assumption. In the case of the languages Signal jSj and 
Lustre jOj, the data flow approach still accentuates the proximity between SFC 
and these languages. On the other hand, if the definition of SFC is purely tex- 
tual and does not provide a clear semantics, languages Signal and Lustre have 
a mathematical semantics. Therefore the modeling of SFC in such languages 0 
gives us a means of clarifying the semantic choices for SFC. This also allows 
us to build a simulator and to check properties. However, for checking quan- 
titative temporal properties, this approach is not suitable. Indeed, the discrete 
representation of time induces an explosion of the number of states of the graph 
representing all possible runs, making the verifications impossible quickly. 

In order to solve this problem of explosion, proposes an approximative 
method based on convex polyhedrons. Verifications are not performed on the 
whole graph but on an abstraction. 

We have chosen another approach which takes physical time into account. 
Thus we have studied timed Petri nets 0, timed transition models (TTM) [T^. 
timed automata 0 H2] and hybrid systems PJ. Timed automata give us a good 
compromise between the power of expression and the possibility of verification. 



3.1 Timed Automata 

Informally, timed automata are automata extended by a set of real variables, 
called clocks, the values of which grow uniformly with the passage of time and 
which can be set to zero. Constraints relating to these clocks are associated with 
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the states and transitions. These timing constraints define the time during which 
the system can remain in a state and the possibility of firing a transition. Timed 
automata thus allow a compact modeling of time. Moreover, verifications using 
model-checking are possible on timed automata. This is why we have chosen 
them to model SFC. 



Definition A timed automaton is a tuple {S ,A,Inv) where 

— S' is a finite set of control locations where Smit is the initial location. 

— H is a finite set of clocks, real variables taking their values in the set of 
positive real numbers. 

— A is a finite set of edges. Each edge is defined by a tuple {sg, tp, I, R, st). 
Ss and Sb are the source and target locations respectively of the edge, r/' is 
a timing constraint. To fire the edge, the clocks must satisfy it. Hs a label. 
R is the set of the clocks to be set to zero when the edge is fired. The edge 

(ss, ip, I, R, Sb) is also noted Sg Sb- 

— Inv: S V’(H) associates with each location a time-progress condition called 
invariant. While the clocks satisfy the invariant, the system may stay in the 
location. 

At the beginning, the system is at the initial location with all the clocks 
having the value 0. 



Semantics The timed automaton semantics is given by a transitions system 
<Q, (so,wo)> where Q is the set of the states, ^ the set of the transitions 
and (so,uo) the initial state. 

— A state (s, v) is a location s and a valuation v of all the clocks. 

— The initial state is the pair (so,fo) where sq is the initial location and vq is 
the valuation which associates 0 with all the clocks. 

— From the state {sg,v), the transition (sg, (j>, 1, R, Sb) can be fired if the 
valuation of clocks satisfies <j). We note f [i?] the valuation of the clocks after 
the firing of the transition which associates 0 with the clocks in R. The 
clocks in R are set to zero, the values of others clocks remain unchanged. 
The following rule expresses that : 



rule 1 : 



s s' A (j){v) 

{s,v)^{s',v[R]) 



While the constraint associated to the state is true, the system may stay in 
the state. 



, „ G \v,v + d\Inv(s,t) 

rule 2 : 

(s, v) — > (s, u -I- d) 

At any state, the system can evolve either by a discrete state change cor- 
responding to a move through an edge that may change the location and reset 
some of the clocks, or by a continuous state change due to the progress of time 
at a location. 
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3.2 Modeling 

First we present the modeling cn in a general way. Then we specify what each 
element of a timed automaton represents. A location represents an SFC situation, 
a value of the inputs and of the temporisations. The transitions correspond 
to a change of the inputs or an evolution of time inducing a change of the 
temporisation values. If these changes bring about the firing of some transitions 
in the SFC program, the target location represents the situation after evolution. 
The invariants of the states and the temporal constraints express the constraints 
resulting from temporisations. 



Location In the general case, a location of a timed automaton is defined by a 
situation of the SFC program, a valuation of the Boolean input variables and the 
values of the temporisations appearing in the SFC program. Several locations of 
the automaton can correspond to a single situation of the SFC program. 

For the SFC program of Figure [H if we suppose that input A is false at 
the beginning, the initial location is {0,10, A, tempoi, tempo 2 , tempos, tempos} 
where tempoi, tempo 2 , tempos and tempos indicate the temporisations lO/XlO, 
15/Xl, 20/X12 and 10/X12 respectively. 



Clock A clock is associated with each step appearing in a temporisation. These 
clocks have for value, the time since when the associated step has been active or 
inactive. 

For the SFC program of the example, 3 clocks are defined: hio for the step 
10 appearing in tempos, hi for the step 1 appearing in tempo 2 , and hi 2 for the 
step 12 appearing in tempos and in tempos^. 



Invariant Associated with a Location The invariant associated with a lo- 
cation expresses the constraint which the clocks have to satisfy, so that no tem- 
porisation changes its value in the location. First of all, we look for the relevant 
clocks in a location, i.e. those associated with a step, being referred in a tempo- 
risation which may change values. They correspond to the clocks which satisfy 
one of the two conditions: 

— the clock is associated with step i, step i is active and there is a false tem- 
porisation referring to step i. This temporisation may become true. 

— the clock is associated with step i, step i is inactive and there is a true 
temporisation referring to step i. This temporisation may become false. 

The constraint associated with a clock checking the first condition is: 
hi < mirij t\j for { tempo j=t\j / Xi/t 2 j with tempo j false }. The constraint 
associated with a clock satisfying the second condition is written: hi < mirij t 2 j 
for { tempo j=t\j / Xi/t 2 j with tempoj true }. 

Finally, the constraint associated with a location is true if the location does 
not comprise any relevant clock or, is the conjunction of the constraints associa- 
ted with the relevant clocks in the other case. 
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For example in the initial node, only the clock hi is relevant because only 
the temporisation tempoi may change. The invariant is written hio < 10. 



Transition The edge of the timed automata corresponds to a change of the 
inputs and/or an evolution of time bringing a modification of the values of tem- 
porisations. An input may change in any location. Only temporisations corres- 
ponding to the relevant clocks in the location may change. 

— The timing constraint associated with a transition indicates if one or more 
temporisations change. When the source location does not include any rele- 
vant clock, the transition is not constrained temporally: its timing constraint 
is “true” . On the other hand if the source location includes one or more rele- 
vant clocks, then the timing constraint is a conjunction of propositions hi=ti 
and hj<tj. The first form corresponds to a change of temporisation while 
the second indicates that temporisation remains unchanged. 

— The clocks of which steps were activated or deactivated during the transition, 
are set to zero in such a way that the value of the clock, is always the time 
since when the step has been active or respectively inactive. For example, 
hi is set to zero on the first transition seen Figure 0 because the step 1 is 
activated. 

— From a situation, inputs, edges of inputs, temporisations and edges of steps, 
the new situation is obtained by a simultaneous firing of all the fireable 
transitions and the new value of temporisations is computed. The target 
location is then defined by the situation reached, the inputs and tempori- 
sations kept up to date. 

The transitions of the timed automaton do not inevitably correspond to the 
firing of SFC program transitions. 

From the initial location {0,10, A, tempoi, tempo 2 ,tempo 3 ,tempo 4 }, three 
transitions are possible according to whether the input A and/or the tempo- 
risation tempoi become true. In Figure 0 these transitions are described. 

Construction The construction of the timed automaton starts with the defi- 
nition of the initial location. Then this location is treated, i.e. its invariant and 
the transitions leaving it are computed. Then new locations are in general built. 
The construction continues, as long as not all the locations have been treated. 
The number of possible locations being finite the algorithm ends. 

The complete automaton representing the SFC program of the Figure 0 has 
3 clocks, 14 locations and 58 transitions in the case of the SS interpretation. 

3.3 Checking 

We model the SFC program with a timed automaton in order to check properties 
using the verification tool Kronos 0. 

The properties expressed on the SFC level, must be translated into Timed 
Computation Tree Logic (TCTL) to be checked by the model-checker Kronos. 
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Fig. 2. First step of the construction of the timed automaton for the SFC pro- 
gram of Fig. 1 



This translation is an important stage of the checking. It requires a thorough 
knowledge of logic and often requires to express the respective property very 
precisely. 

TCTL TCTL |Zj is a temporal logic which extends arborescent logic CTL 
by introducing a global variable: time. As a tree logic, TCTL uses symbols 
that concern at the same time the set of all possible executions (3: there is an 
execution, V: for all the executions) and the set of execution states (O: there 
is a state, □: for all the states, U\ until a state). In order to introduce time 
explicitly into the syntax, the scope of the temporal operators is time-limited. 
Thus, the formula Vn< 4 p intuitively means that, for all the executions of the 
system, proposition p is true for all the states until the fourth time unit. 



Some Properties TCTL, although reserved to express quantitative temporal 
properties, renders it possible to write the usual qualitative temporal properties. 

Thus to check that a situation is a deadlock, i.e. a situation which it is 
not possible any more to leave, various formulae can be defined. The following 
formula makes it possible to know if the situation S is reachable and is always 
a deadlock: {init => 3<^> S) A {init VD (S' =aVD S)). 

Without being always a deadlock, a situation may be locked in some cases: 
init ^ 3^ (S A (S ^ VD S)). 

We show that the situation {0,11,14} is not a deadlock but on the other 
hand, once the steps 11 and 14 are reached, they infinitely remain active. 

On timed automata, we can also check quantitative temporal properties: 

— We can check the duration of activation of a step: does step i remain active 
more than (at least) t units of time? For instance, we check that step 1 could 
remain active more than 15 units of time by showing that the following 
formula is true: 
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init =i>3<) ( eO_l A ( eO_l eO_l 3W>is eO-1)) where eO_l is the proposition 
associated with the location when the step 1 is active. 

— We can also study the time which separates two activations of distinct situa- 
tions S\ and 82- Thus to show that between the activation of S\ and the 
activation of 82, the maximum duration is lower than t, the following formula 
must be false: 

init => 3<) ( 5 i A (^1 ^ ^ ^2 8 U>t - ^2))). 

Using the Kronos tool, we succeed in checking properties on the timed au- 
tomata resulting from the SFC programs. By this method, we can check SFC 
programs of more important size and which have more temporisations. Moreover 
delays are not a limitation any more. Indeed, the complexity of the algorithm of 
verification is independent of delays. 

4 Applications 

We have studied the production cell Korso . The programming of the control 
of this application is achieved very easily in an SFC program. For the checking, 
we have to solve two problems: taking environment into account and reducing 
size of the automata. We present the solutions we have found and the tools we 
have developped. 



4.1 Taking Environment into Account 

To explain the problem, we take an element of the operative part of Korso, the 
press as an example. The press consists of a horizontal plate which can move 
vertically. The SFC program of the press is given in Figure El In steps 50 and 
51, the press waits in the median position (cop2) until a metal blank is loaded 
by the robot (step 33 of the robot is then reached). Then it goes up (step 52) to 
the high position (cap3) and the metal blank is worked (step 53) during 2 units 
of time. Then the press goes down (step 54) to the low position (copl) where it 
waits (step 55) to be discharged by the robot (step 41 or 42 or 43 of the robot). 
Finally it goes up (steps 56 and 57) until reaching its median initial position 
again. The process can then start again. 

This SFC program is synchronized with the robot by the step variables X33, 
X41, X42 and X43. To study it separately, we let go these synchronizations by 
replacing ”X33” by the variable vX33 and ” X41 or X42 or X43 ” by the variable 
vX4. These variables vX33 and vX4 evolve freely. 

We build the timed corresponding automaton. It has 1296 locations, 262 896 
transitions and 3 clocks. It is too large to be checked by the Kronos tool which 
accepts only automata having fewer than the 65000 transitions. 

Moreover, the automaton has locations which represent the press in the high 
and low positions simultaneously. In normal running, these have no sense ; they 
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^ 50:cap2 and X33 
I { press loaded in median position } 
I 5l| 

— 51:107X51 

{ arm cleared from press ) 

52| 1 S I pr_up 

- 52:cap3 
{ press high } 

I 53| 1 R I pr_up I 

-53:27X53 

{ blank processed } 

I 54| 1 S I pr_down | 

— 54:cap 1 

{ press low } 

I 55| 1 R I pr_down | 

— 55:X41 orX42 or X43 
{ press unloaded } 

56| 1 S I pr_up I 

- 56:cap2 

{ press in median position } 



57 ] 1 R I pr_u^ 



57:17X57 

{ press ready for blank } 



Fig. 3. SFC program of the press 



do not fulfill the constraints of the environment. This is why the construction of 
the automaton was then modified so that only the locations satisfying the cons- 
traints of the environment are considered. During our study, we have encountered 
three kinds of constraints, according to how they relate to the locations and/or 
the transitions: 

— Only one of the sensors capl, cap2, cap3 may be true at one moment because 
the press is in a single position. A stronger constraint can be expressed if 
inputs cl2 and c23 (representing the position between top and medium and 
the position between medium and low) are introduced. In this case, it is 
necessary that there should be one and only one of the sensors (capl, cl2, 
cap2, c23, cap2>) true at a given moment. The locations which do not check 
this constraint are removed. 

— The changes of values of the sensors are constrained to pass from the low 
position to the high position by a medium position. The transitions which 
do not satisfy this constraint are removed. 

— The constraints handling, at the same time, the locations and the transitions 
express the links which exist between the actions and the sensors. Thus when 
the action prjup is done, the low position is not more reachable. In the same 
way when the action prjdown proceeds, reaching the high position is not 
possible any more. 
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We are interested in two properties of the press: 

— The formula expressing that the press should not be moved to low if the 
sensor capl is true, is written: init VD ^ (pr_down and capl) 

— In the same way, to show that the press should not be moved to high if the 
sensor cap3 is true, it should be shown that the following formula is true: 
init Vn ^ (pr_up and cap3) 

By introducing the inputs cl2 and c23 and by considering only the first two 
kinds of constraints, the automaton built has 922 locations and 57606 transitions. 
On this automaton, the two properties are false. 

Moreover while inserting, the constraints resulting from the actions, the au- 
tomaton then has 314 locations and 13670 transitions. 

The first property is always false, this is due to the relaxation of synchro- 
nizations which produces an instability. Thus step 55 is not always activated ; 
it follows that the action “stop to go down” is not always carried out when capl 
is true. 

The second property is true showing that the environment has been taken 
into account sufficiently. 

Working on a more realistic representation, we can check more properties. 
Taking into account the environment makes it possible to decrease the size of 
the automaton but does not solve all the problems of size. 

4.2 Reduction of the Size 

During our verifications, we wish to know which situations are reachable and 
which values can take the inputs in these situations. The given modeling makes 
it possible to answer these questions. It is however possible to consider other 
modelings solving this problem. If a Boolean formula could be associated with 
a location, the most compact modeling would consist of a timed automaton re- 
duced to the graph of the situations. As only conjunctions of the propositional 
variables can be associated with the locations, we propose a smaller modeling 
than initial modeling but not reaching the graph of the situations. 

In a location, we do not indicate the value of each input any more but only the 
value of the important inputs. For a particular situation, an input is important 
if a modification of its value can induce an evolution of the SFC program. In the 
timed automaton, a transition is defined only if it corresponds to a modification 
of the important inputs or a modification of temporisations. 

The initial situation obtained, we determine the important inputs for this 
situation. The values of these important inputs are then fixed. The initial location 
is completely defined. For example, for the grafcet of the Figure 0 and WORS 
interpretation, the only important input for the step 0 is a. As a is initially false, 
the initial location is written (0,a, 6, c) where e means that the value of the 
input e does not have importance for the current situation. 

Then, as long as there remains a state to be treated, we construct the whole 
automaton by the following operations: for each temporal event, for each possible 
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Fig. 4. SFC program and corresponding timed automaton for the new modeling 
in NSS interpretation (the timing constraints are not written) 



combination of the important inputs of this state, we study the target situation. 
If an input is important for the target location, its value must be fixed. Two 
cases can occur: 

— either it is important in the source location, its value is then perfectly defined. 
It is the case of the input b for the evolution from (1, a, b, c) corresponding 
to [b. 

— either it is not important in the source location. The target location is divided 
into two sets of locations, one representing the true input, the other the false 
input. For example, the virtual location (2, a, b, c) is reachable from the 
location (l,o, b, c). This location, where c is a important input, is divided 
into two: (2, a, b, c) and (2, d, b, c). 

If an input is not important in the target location, then if it is important in the 
source location, its value is free. For example, the input b of the virtual location 
(2, d, b, c) is slackened for example in (2, d, b, c). 

In this modeling, a location represents several locations of the preceding 
modeling, in the same way the number of transitions is reduced. Thus for the 
example and NSS interpretation, the timed automaton has 22 locations and 176 
transitions in the first modeling and 7 locations and 19 transitions for the new 
modeling. For SS interpretation, the timed automaton has 16 locations and 112 
transitions for the first modeling and 11 location and 66 transitions for the new 
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modeling. The reduction is more sensitive for the NSS interpretation than the SS 
interpretation ; indeed, the number of significant inputs relative to a situation 
is smaller in NSS than in SS. 

This modeling makes it possible to decrease the size of the timed automata 
in a considerable way. On the other hand, it is difficult to take environment into 
account with this modeling. Indeed, as it is not possible to consider Boolean 
formulas at the level of the locations, the constrained inputs must be considered 
important in all situations. Therefore in the worst case, that is to say all the 
inputs are constrained, no profit will be obtained because of new modeling. 

We have also studied techniques allowing to decrease the size of the systems to 
be checked: the composition and the abstraction. These techniques are powerful. 
However, for the timed systems, their study is relatively recent and few results 
have been obtained. Their application to the checking of SFC programs is not 
immediate and still requires basic work on the timed systems. 

4.3 Tools 

To facilitate the design and the checking, various tools have been built such an 
editor of SFC progams (see Figure[0, a simulator, the translators SFC programs- 
timed automata as well as an interface of verification. 

The interface makes it possible to choose the parameters of the checking and 
to execute the chain of tools which produce the result of the checking. 

It was developed in Tcl/Tk and it is composed by a control panel (see Figure 
0. From this one, the user can choose the various parameters of the verification: 

— the SFC program. 

— the property which he wants to check. The properties are expressed in a literal 
way. They are reachable in a tree structure. 

— options. The choice of interpretation (SRS or ARS) is possible. We can more- 
over specify if the simultaneous modifications of inputs are authorized or not. 
The possibility of taking the environment into account was also given. For 
each type of constraint defined in the r)a,ra,gra,nh l4. II a window of data entry 
has been defined. 

When the checking is started, the interface takes care of several tasks: 

— construction of the TCTL formula corresponding to the property, 

— construction of the timed automaton corresponding to the SFC program 

— call the tool of verification 

On the Figure El the result of the verification of a property on the SFC is 
shown. 

5 Conclusion 

By this work, we show that it is possible to take time into account in modeling of 
SFC programs and to check their qualitative or quantitative temporal properties. 
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In modeling with the synchronous languages, we represent discrete time. 
During the checking, this modeling leads to a combinatorial explosion of the 
number of states, each moment being represented by a state. 

We then have turned to timed automata. This formalism takes into account 
continuous time in its definition, owing to real variables called clocks. With 
each step i referred in a temporisation tijXijt 2 , we associate a clock. This 
computes the time since which the step has been active or inactive. On a timed 
automaton resulting from this modeling, we could check temporal properties 
such accessibility in a minimum or maximum time, or the durations of minimum 
activity and maximum. For this verification, the delays are not a limitation any 
more. 

On the other hand, the size of the automata remains a barrier to the checking. 
An automaton should not have more than 65000 transitions, so that Kronos can 
treat it. Unfortunately, some of the automata generated from the SFC programs 
can have more than 100000 transitions. In order to solve this problem, solutions 
have been studied. Taking the environment at the level of the states and the 
transitions into account enables us to decrease the size of the automata conside- 
rably. In the same way, a proposed new modeling makes it possible to reduce 
the number of states and transitions from the automata generated. However to 
increase the size of the SFC programs that can be treated, efforts must continue 
in these directions as in the study of the verification techniques of composition 
and abstraction. 
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To ensure more safety of SFC programs, checking only does not seem to be 
sufficient. In parallel, we think that a methodology should be developed making 
it possible to avoid design errors. This methodology could perhaps also support 
the building of more easily verifiable SFC programs. It seems also important to 
us to confront the models and theories already developed with the industrial 
applications. 

We wish to thank the anonymous referee who helped us to improve our pidgin 
English. 
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Abstract. In this paper, we discuss the underlying ideas of a computer 
laboratory for symbolic manipulation of discrete random experiments. 
Finite automata, and associated formal series, are the basic theoretical 
tool for representing experiments, and for solving probability problems. 
Starting from a description of a random experiment given as a special 
kind of regular expressions, the environment constructs automata from 
which it extracts generating series associated to the experiment. 



1 Introduction 

The interaction between discrete probabilities and formal language theory has a 
long story. Probabilistic tools have been used early in information theory cn, 
automata theory and coding theory More recently, evidence of the 

usefulness of this interaction have been found in fields like asymptotic analysis 
of algorithms |0|, allocation problems, concurrency measures P|, communication 
protocols specification and verification, and random generation of combinatorial 
structures. 

Most of these approaches rely on describing various subsets of A* , the set of 
words on a finite alphabet A, defining probability measures on the elements of 
these subsets, and then deriving formal properties with the usual constructions of 
probability theory. We are interested in developing the underlying computational 
aspects of these methods in order to construct symbolic tools that can be used 
to assist research or education in these fields. 

The main problem can be stated as the formal manipulation of discrete ran- 
dom experiments. This primarily involves describing algebras of experiments and 
linking these to algorithms that can compute adequate probabilistic parameters. 

2 Discrete Random Experiments 

A discrete random experiment is often described by the set f2 of its possible 
outcomes, and by a function 



p : ^ [0, 1] 

assigning a probability p(e) to each event e S 17, such that ~ 

J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 164-|r£_3 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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For example, consider the experiment of tossing a coin repeatedly until either 
the pattern head-head-tail or head-tail-tail shows up [P 74]. Writing h for head 
and t for tail, the set Q of possible outcomes is: 

fi = {hht, htt, hhht, thht, thtt, . . .} 

If we assume that the coin is fair, and that the trials are independent, we 
can assign to each sequence rir 2 . . .r„ in 17 the probability (1/2)". 

We are interested in answering the following type of questions about the 
experiment: 

What is the probability that the pattern head-head-tail shows up 
before the pattern head-tail-tail? 

What is the mean length of a sequence? 

Such questions can be answered with classical tools of probability theory, but 
the computations involve infinite series that are often difficult to represent or to 
sum. Our goal is to automate the representation and solutions to such problems, 
at least for some classes of random experiments. (Answers to the above questions 
are given in Section 4.3.) 

2.1 Trials and Experiments 

In the sequel, we will focus on the class of discrete experiments that are inde- 
pendent repetitions of an elementary trial. 

Definition 1 A trial is a simple random experiment whose set of possible out- 
comes is finite. 

For example, tossing a coin is a trial with possible outcomes A = {h,t\. 
To each event in the set of possible outcomes of a trial, we assign a numerical 
probability: 



or a symbolic one: 



p{h) = 1/3, p{f) = 2/3, 



p{h)=p, p{t) = l-p, 
such that the sum of probabilities is equal to 1. 

Definition 2 An experiment is the independent repetition of a trial until the 
sequence of results e = r\r 2 exhibits a given property, called the stopping 

condition. 

By the independence hypothesis, we can assign the probability 



p[e) =p(ri)p(r 2 )...p(r„) 
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to an element e = r\r 2 . . . G f2, where p{ri) is the probability that the ele- 
mentary trial yields r^. 

Note that, in this definition, it is assumed that no proper prefix of rir 2 . . . 
satisfies the stopping condition. For example, in the coin tossing example, the se- 
quence hhtt does not belong to 17 since the experiment would have been stopped 
after the third throw. 

Furthermore, it is not clear if Definition 2 captures the usual notion of ran- 
dom experiment. Indeed, the concept of stopping condition is quite vague. For 
example, a condition could be hard to test, as in “pick up random digits and stop 
when the last ten appear consecutively in the decimal expansion of it”. Even easy 
to test but improbable properties such as “toss a fair coin until the number of 
heads equals a hundred times the number of tails” suggest that some experiments 
are not guaranteed to stop. 

In the sequel, we will restrict the possible outcomes of an experiment to finite 
sequences of results, and we will want to have reasonable evidence that one of 
these outcomes will happen. We will formalize these concepts in the next section. 

3 Representing Experiments with Automata 

The problem of describing the set of finite possible outcomes 17 associated to 
an experiment can easily be reformulated as a language theoretic problem: Let 
the finite alphabet A be the set of outcomes of a trial, then the set f2 of finite 
possible outcomes of an experiment is a subset of A* . 

It is thus natural to turn to automata theory in order to investigate different 
formalisms for describing sets of possible outcomes. Of these many formalisms, 
finite automata are the best suited for symbolic computations. We will see that 
if 17 can be represented by a finite automaton, then most questions about the 
experiment can be symbolically derived from the automaton. 

Consider the example of tossing a coin until either hht or htt appears. The 
set 17 of possible outcomes can be represented by the automaton of Fig. 0 with 
initial state 0, and final state 4. 



hht 




This automaton is a simple mechanism to test membership to the set 17. As 
a generating device, it can also be used to simulate an experiment: in each state. 
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the result of the elementary trial is randomly chosen according to the given prob- 
abilities, and the automaton goes to its next state by following the transition 
corresponding to the result of the trial. 

Since our goal is to allow the definition and formal manipulation of ran- 
dom experiments in a computer environment, we must design a formal language 
that allows the construction of complex experiments starting from simple ones. 
Consider, for example, the stopping condition: 

“...until htt or hht appears”. 

It generates the set of possible outcomes: 

l7i = {hht, htt, hhht, thht, thtt , . . . }. 

The above condition can be naturally split in two simpler conditions: 

“...until htt appears”, 

“...until hht appears”. 

Which, respectively, generates the sets of possible outcomes: 

f^2 = {htt, hhtt, thtt, hhhtt, hthtt, tthtt , . . . }, 

C 3 = {hht, hhht, thht, hhhht, hthht, tthht . . . }. 

Since the initial condition is a disjunction, the obvious choice to express 
would be as the union of f ?2 and C 3 , but it does not work. For example, the 
result hhhtt belongs to C 2 U C 3 , but is not in l7i since it has a proper prefix 
ending with hht. 

Thus, complex experiments cannot be directly constructed from simpler ones 
using boolean operators. However, in the following section, we will consider a 
class of languages that is closed under a few basic operations, and that will 
provide a suitable frame to express random experiments. 

3.1 Semaphore Codes 

Let A be a finite alphabet, and s be a non empty word in A*. Consider the 
experiment with a simple stopping conditions of the form “...until the sequence 
s appears”. The set f2 of possible outcomes of this experiment can be described 
as the set of sequences that end with s, but for which no proper prefix ends by 
s. This set can be represented by the regular expression: 



n = A*s- A*sA+. 

This expression gives the definition of 17 in a compact form, and can be 
applied to any non empty word s. Moreover, if s and t are two non empty words 
in A*, the set of sequences that end with either s or t, but for which no proper 
prefix ends by s or t, can be represented by the regular expression: 

A*{s + t)~ A*{s + t)A+. 
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This representation displays nicely where the conjunction or of the stopping 
condition effectively occurs, ie. in (s + t). Sets constructed this way are studied 
in coding theory under the name of J-codes m, or semaphore codes Pj. We 
have the following definition: 

Definition 3 Let A be an alphabet, and S be a non-empty subset of A* , then 
the set: 



X = A*S - A*SA+ 

is called a semaphore code. The set S is the set of semaphores of X. 

The most important consequence of this definition is that any formalism used 
to describe rational sets can be used to describe stopping conditions, as long as 
it is understood that only the semaphores are described. Indeed, given a rational 
set S of stopping conditions, the set A*S — A*SA'^ will also be rational, thus 
representable by a finite automaton. 

3.2 Constructing Automata in the Laboratory 

Semaphore codes enable us to construct a simple formalism to describe random 
experiments with stopping conditions. We chose three basic constructions to 
which we associate algorithms that construct the automaton recognizing the 
corresponding set of possible outcomes. 

A basic experiment on an alphabet A is an experiment with a stopping con- 
dition of the form: 



“...until the word s appears” 

where s is a non empty word of A* . The notation for a basic experiment will be 
simply the word s. 

In the preceding section, we saw that this set is recognizable by a finite 
automaton. Indeed, constructing the corresponding can be done with a classical 
string matching algorithm that uses automata. Given a word s = siS 2 . . . Sn, we 
first construct a non-deterministic automaton that recognizes all sequences that 
contains s as a factor (Fig. El). 




It can be shown, see for example, that the corresponding deterministic 
automaton has also n states, and only one final state. Since we are interested in 
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the first occurrence of the pattern, we remove any transition that is defined in 
the final state. For example, the stopping condition “...until hht appears” yields 
the automaton of Fig. 0 




Fig. 3. Automaton recognizing sequences ending with hht 



In order to describe more complex experiments, we define two operators. 
The first one is a disjunction, written as Ei + E2 and expresses the stopping 
condition: 



“...until either eondition Ei or condition E2 is satisfied” 

For example, the condition “...until htt or hht appears” will be written as: 

hht + htt. 

The second operator is used to express sequences of conditions. The expres- 
sion El * E2 will correspond to the stopping condition: 

“until condition E\ is satisfied then condition E2 is satisfied” 

For example, the condition “...until h appears and then t appears”, whose set 
of possible outcomes is: 



{ht, tht, hht, hhht, thht, ttht , . . . } 

would be written as: 



h *t. 

To each expression E constructed with words and the two operators -I- and 
*, we associate a set C{E). If E is the word s, then C{E) is the semaphore set: 



A*s- A*sA+. 

The sets associated to complex expressions are sets recognized by automata 
recursively constructed with the following rules, where A\ is the automaton as- 
sociated with E\, and A2 the automaton associated with E2. Each rule yields 
an automaton that has one final state with no outgoing transition. 

1 ) The automaton A associated to the stopping condition E1+E2 is obtained 
by computing the Cartesian product of the two automata Ai and A2, and by 
considering any state (si, S2) of the product as final if either si or S2 is final. 
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The initial state of A is where ii is the initial state of Ai, and is 

the initial state of A2 ■ The transition 

(si,S2) • r 

is defined if and only if si • r and S2 ■ r are defined in A\ and A2- 

If both A\ and A2 have only one final state with no outgoing transition, then 
any final state (si, S2) of A will have no outgoing transition, thus all these states 
are equivalent and we can identify them. 

2) The automaton A associated to the stopping condition Ei * E2 is obtained 
by gluing the final state of A\ to the initial state of A2, as in Fig. 01 The initial 
state of A will be the initial state of A\ , and its final state will be the final state 
of A2- 





Fig. 4. Gluing A\ and .42 



If the final state of .4i has no outgoing transition, then this construction 
yields a deterministic automaton. If the final of A2 has no outgoing transition, 
then A inherits that property by construction. 

Thus, any expression describes a set of sequences that can be recognized by a 
deterministic automaton. Moreover, all sets constructed this way are semaphore 
codes. Indeed, basic experiments of the form “...until the sequence s appears” 
are semaphore codes with s as semaphore, and we have: 

Proposition 1 If L{Ei) and C{E2) are semaphore codes, thenC{Ei+E2) and 
C{Ei * E2) are also semaphore codes. 

Proof: Let Si and S2 be the semaphores of C{Ei) and C{E2) respectively. 
Then the set of semaphores of C{Ei + E2) is simply the union of S\ and S2. 
Indeed, if e G C{Ei + E2) then, by construction, e is a prefix of words in both 
C{Ei) and £(£^2), thus e has no proper factor that belong to S\ U S2. Since e is 
also a word of either C{Ei) and £(£2), it ends with a word in U S'2. 

On the other hand, suppose that e € A* {Si + S2) — ^*(£1 + S'2)^''', and 
suppose that e can be written as e's where s G Si. Then e is a word of £(£i), 
and since e does not contain any proper factor in S2, it is a prefix of a word in 
£(£2). Thus e belongs to £(£i + £2) by construction. 
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In order to prove the second part of the proposition, we will show that the 
set of semaphores of C{Ei * E2) is • C{E2). If e G C{Ei * E2), then e can 
be written as eiSie2S2 with si G and S2inS2, thus e ends with a word in 
Si ■ C{E2). Moreover e cannot have a proper prefix of the form s'ie'2S^2 since it 
would imply that eiSi would have s' 1 as a proper factor, or €282 would have s' 2 
as a proper factor. 

Suppose now that e belongs to the set with semaphores • C{E2). Then e 
can be written as 61(516252), thus it is a prefix of a word in L{Ei * E2). Since e 
has no proper factor in S\ ■ C{E2), we have e G C{Ei * E2). ■ 



3.3 Examples and Counterexamples 

The following examples show that a vast range of experiments can be defined 
with the expressions of Section 3 . 2 . In these examples, we suppose that the 
possible outcomes of a trial are A = {a, b, c}, and that n is a natural number. 

Example 1 . Stop after n trials: 

(a + 6 + c) * (a + 6 + c) * . . . * (a + 6 + c) . 

Example 2 . Stop after n occurrences of the symbol a: 

Cl * ct * . . . * ct. 

Example 3 . Stop after each symbol of A appears at least once: 

(a * 6 * c) + (a * c * b) + (6 * a * c) +(b * c * a) + (c *a*6) + (c*b* a). 

On the other hand, some stopping conditions cannot be represented by such 
expressions. The simplest counterexample is to consider a two symbol alphabet 
A = {a, 6 } with the stopping condition “until the number of a is equal to the 
number ofb”. It is well known in automata theory that the corresponding set Q 
is not rational. 

Finally, even if semaphore codes can express a wide variety of experiments, 
not all experiments are semaphore. Consider, for example, the following set of 
sequences over the alphabet {a, b}: 

{a, baa, bob, bb} 

Suppose that one defines an experiment that stops when the sequence of all 
trials is equal to one of those four sequences. If we set p{a) = 1/2 and p{b) = 1 / 2 , 
then, assuming independence, the sum of the probabilities of the four sequences 
in this set is equal to 1 . Thus the experiment stops after at most 3 trials but, 
since a is a proper factor of baa, this set is not semaphore. 
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3.4 Does It Stop? 

We now turn to the problem of deciding if an experiment will stop or, to put 
this more formally: Does the sum of the probabilities of all events in Q is equal 
to 1? 

This is clearly not true for most subsets of A*, but one can find many pos- 
itive results within code theory. All automata constructed with the notation of 
the previous section are semaphore codes, and one of the nice aspect of using 
semaphore codes is that the corresponding random experiment will always stop 
if the probabilities associated to the results of a trial are positive. The following 
result is a consequence of a more general theorem that can be found in [HP 89] . 

Proposition 2 Let A be a finite set and p : A — > [0, 1] such that p{r) > 0 for 
all r G A, and function p can be extended multiplicatively 

to words in A* . 

Let S G A* be a non empty rational set, then if f2 = A* S — A* SA'^ we have 

^p(e) = 1 



Thus, any experiment whose stopping condition is described with the oper- 
ators of Section 3.2 will stop with probability 1. 

4 Symbolic Computations 

Once the possible outcomes of an experiment is represented by a finite automa- 
ton, it is possible to answer algorithmically to many questions about the exper- 
iment. In this section, we give the flavor of these algorithms whose details can 
be found in [B 98]. 



4.1 Series Associated to an Experiment 

Let’s consider again the coin tossing example. A way to encapsulate the infor- 
mation about the experiment is to write it as an - infinite - formal sum: 

E{h, t) = ^hht + \htt + ^hhht + ^thht + -^thtt + -^hhhht + . . . 

8 8 16 16 16 32 

This sum is obtained by formally multiplying a result e of the experiment by 
its probability p(e), and summing over all possible values e G L2, that is: 

E{h,t) = ^p(e)e 

e^Q 
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This function can be used to answer questions about some random variables 
defined over the experiment. For example, consider the random variable associ- 
ating to a result its length, then the generating function F(x) of this variable is 
given by: 



F{x) = E{x, a;) = -b ^a;^ -b 

8 8 16 



16 



16 



32 



By differentiating this function, and evaluating the result for x = 1, we obtain 
the mean length of the experiment as the series: 






All these observations make little sense if there is no way to get rid of the 
infinite series. As we will show in the next section, for the class of experiments 
that can be described in the laboratory, it is always possible to obtain a closed 
form for the expression: 



E{h,t) = ^p(e)e 

With such a closed form, the computations involved in dealing with random 
variables are substitution and differentiation, both of which can be carried out 
with a symbolic mathematics environment. 



4.2 Computing Closed Forms for Series 

Results from formal series theory (see, for example, 0) tell us that if 17 is a 
rational set, and if p{e) is defined as the product of the probabilities of the 
elementary results in A, then the series: 



E = '^P{e)e 

is rational, meaning that it can be expressed as the quotient of two polynomials 
in the variables o G A. 

One way to compute this quotient is first to relabel the automaton recog- 
nizing 17 by multiplying each result r by its probability p{r). For example, the 
automaton of Fig. E is relabeled in Fig.0 

We can represent this new automaton with a matrix M indexed by its states, 
and whose entry (i,j) is the formal sum of the expressions p{r)r for each tran- 
sition between states i and state j. 

For example, the matrix corresponding to the automaton of Fig. El is the 
following: 
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m t/2 




Fig. 5. Relabeling an automaton 



M = 



t/2h/2 0 0 0 

0 0 h/2t/2 0 

0 Q h/2 Q t/2 

0 V2 0 0 t/2 

0 0 0 0 0 



If we multiply the matrix M by itself k times, then the {i,j) entry of 
will be the formal sum of all terms obtained by taking the product of labels of 
all paths of length k from state i to state j. For example, the entry (0, 2) of 
would be: 



thh hhh 




Let M* be the infinite sum I + M + + . . . . Then the entry (t, j) of 

M* contains the formal sum: 



^p(e)e 

where the sum is taken over all paths from state i to state j. Thus the entry 
(0,n) of M*, where 0 is the initial state and n is the final state, is exactly: 



E = '^P{e)e 

The matrix M* is also given by (/ — M)~^ since 



M*{I -M)=M* - MM* = I 



We can thus obtain the value of M* by inverting the matrix (I—M). This can 
be done with the help of a symbolic mathematics environment such as Maple. 
In the case of the coin tossing experiment we get: 



E = M* 



(0.4) 



2 — t 4: — th 



-tt- 



2 — t 4 — th 2 — h 
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4.3 Random Variables and Other Computations 

A linear random variable is the restriction to 17 of a function V : A* ^ N 
such that V{ef) = V(e) + V{f). Examples of linear random variable are the 
length of a word, or the number of occurrences of a symbol in a word. If V is a 
random variable, then its generating function can be obtained by replacing each 
symbol r £ A hy in the series representing the experiment. For example, 

the generating series of the length in the coin tossing example is given by: 

x^ 

^ (2-x)(4-x2) + (2-x)(4-x2)(2-x) 

thus, the mean length of the experiment is equal to = 5^. 

The probability of events that can be described by automata can also be 
computed using the same techniques. For example, let X be the event “the 
pattern hht shows up first”. Then X can be described by the intersection of the 
sets recognized by the automata of Fig. Q] and Fig. 0 Since this intersection is 
again rational, we can evaluate: 

112 

E p(e)e = h- -h- rf 

^ ' 2-t 4-th 2-h 

eGX 

and compute the probability of X by replacing h and t by 1 in this expression, 
yielding |. 

5 Technical Aspects 

The core of the laboratory is written in JAVA. It handles interactions with the 
user such as the definition of trials and experiments, and occasionaly calling 
Maple to solve symbolic matrix inversion and differentiation. In order to set up 
an experiment, one must first enumerate the results of trial, and their associated 
probabilities: 

CoinFlip <- Def ineTrial ({h, t}; p(h) = 1/2, p(t) = 1/2). 

An experiment is defined by a trial and a stopping condition, such as: 

Experiment! <- Def ineExperiment (CoinFlip, htt + hht). 

Answering questions about an experiment often involves the definition of a 
random variable. For example, the length can be defined by the call: 

length <- Def ineVariable (Experiment! , freq(h) + freq(t)) 

and one can obtain either its expectation, or probabilities related to this variable 
such as: 



Probability (length =4)? 
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The laboratory can also compute the probability of an event defined with 
the expressions used for stopping conditions. For example, the probability that 
the experiment stops with hht is computed by the call: 

Probability (hht) ? 

Small experiments are handled efficiently. The main computational problem 
is matrix inversion. The size of matrices is equal to the number of states of 
automata describing the possible outcomes of an experiment. The first and third 
constructions of Section 3.2 yield reasonably small automata: in the first case, 
the number of states is given by the length of the word s, and in the third 
case, it is given by the sum of the number of states of Ai and A 2 - The second 
construction, associating an automaton to Ei + E 2 , involves a product that can 
be responsible of a polynomial growth of the number of states. 

The current implementation of the laboratory is quite straightforward. How- 
ever, while trying to find the computational limits of this implementation, we 
had to construct far stretched experiments. 

For example, consider the coin tossing experiment in which the stopping 
condition is to obtain 60 consecutive tails: 

tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt. 

In 5 minutes, the laboratory gave the expected length of the experiment as: 

2305843009213693950 

Table 1 gives a compilation of the expected length and time of computation 
for the experiments in which the stopping condition is to obtain k consecutive 
tails. 



Table 1. Expected length and computation time for k consecutive tails 



k 


Expected length 


Time (sec) 


2 


6 


6 


10 


2046 


7 


20 


2097150 


10 


30 


2147483646 


22 


40 


2199023255550 


46 


50 


2251799813685246 


104 


60 


2305843009213693950 


274 


70 


2361183241434822606846 


489 


80 


2417851639229258349412350 


1003 



Since the stopping condition for these experiments yields a very simple ma- 
trix, we suspected that more complex patterns would tax the resources of Maple. 
Indeed, in computing the expected length of experiments in which the stopping 
condition is to obtain a sequence of alternating h and t of length k, we had to 
stop at A: = 50. The results for smaller values are given in Table 2. 
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Table 2. Expected length and computation time for k alternating h and t 



k 


Expected length 


Time (sec) 


2 


4 


6 


10 


1364 


7 


20 


1398100 


16 


30 


1431655764 


69 


40 


1466015503700 


331 



6 Conclusion 

Our initial goal was to develop tools for computational probabilities. Using 
semaphore codes as a suitable class of languages to describe the possible out- 
comes of an experiment, we were able to construct a computer environment in 
which random experiments can be defined, and queries about probabilities of 
events are automatically answered. 

The current implementation is not optimized. Nevertheless, the scope of ex- 
periments that can be solved is impressive, thanks to Maple. Further develop- 
ments include the availability of the laboratory on the Internet, and the explo- 
ration of other classes of languages that could yield interesting computational 
results. 
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Abstract. We show that the concept of automata minimization leads to 
a nice interpretation of the famous canonicity of binary decision diagrams 
discovered by Bryant. 



1 Introduction 

The aim of this paper is to enlighten the links between automaton minimiza- 
tion and the construction of binary decision diagrams of Boolean functions. We 
give some direct applications to complexity. Section 2 recalls basic definitions 
regarding the theories of Boolean functions and automata. Section 3 deals with 
complexity results. Section 4 asks some questions. 

2 Definitions and Conventions 

2.1 Boolean Functions 

In this field it is convenient to introduce mathematical structure “just in time” 
so we choose this poor definition: 

Definition 1 A Boolean function on an arbitrary set A is a map from A to the 
set of integers B — {0, 1}. 

So a Boolean function can be identified with its graph which is a part of Ax (0, 1}. 
This graph is a set of pairs, called the truth table of the Boolean function. In 
computer science we don’t deal with arbitrary sets A but with sets of n-tuples 
of bits. The set A is i?" = (0, 1}" for some integer n > 0. We shall assume that 
we are dealing with those sets only. We would like to stress that this is not a 
light assumption: it means that the set A is very special set with plenty of magic 
properties. 

This implies that a Boolean function is now a function of n variables each 
ranging over B and with values in B: 

(a;i, ...,Xn) e 1-^ /(xi, ...,Xn) G B 

We have a canonical order on B induced by that of IV : 0 < 1. This order on 
B induces lexicographically a total order on A. This permits one to present the 
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truth table of any Boolean fonction / as the well known ordered table with 2" 
rows and 2 columns. The classical other way to present the truth table is that 
of a binary ordered tree. Its construction is as follows: 

a) The root vertex is labeled with x\. 

b) The two sons of any vertex labeled Xi are labeled Xi+\ \ii <n. 

c) The two sons of the vertices labeled Xn are leaves. 

d) The left (resp. right) edge starting from an interior vertex is labeled with 0 
(resp. 1). 

e) Each of the 2" leaves is labeled with the value of / on the n-tuple correspond- 
ing to the unique path from the root Xi to this leaf. 

We call this tree the canonical tree associated with / and denote it T{f). 



2.2 Binary Decision Diagrams 

A binary decision diagram is a member of a family of graphs containing binary 
trees like T(/) and all graphs inductively reduced by the following two rules: 
(Reduction 1) Identify two isomorphic subtrees G and H by deleting one and 
redirecting incoming edges to the root of the other. 

(Reduction 2) If a vertex v has exactly one son, suppress v and the two edges 
starting from it and connect its incoming edges The graphs obtained at each step 
are always directed, acyclic, single originated; their branching index is at most 
two and they have one or two terminal vertices respectively labeled 0 and 1. This 
process is confluent and terminates. Applied to T(/) the result is called the BDD 
or, more precisely, the ROBDD (reduced ordered BDD) of /. The complexity of 
a BDD is defined as the number of its vertices. Sometimes the complexity is very 
low and sometimes exponential in the number of variables. When the complexity 
of a BDD is low computers can work with and manipulate the Boolean function 
it refers to. This subject is of great interest in verification and also in reliability 
theory. We refer the reader to pnn tnr?^ for excellent accounts and detailed 
explanations. 

Definition 2 The quasi-reduced BDD of a Boolean function is the graph ob- 
tained by iterating reduction steps of type 1 until this reduction cannot be applied 
any longer. 

We shall speak of the QRBDD of /. 

2.3 Automata 

Let A be a finite alphabet. An automaton over A is a 4-tuple A = (Q,/,F, A) 
where Q is a set of states^ I is a, subset of Q whose elements are the initial 
states, F is a subset of Q whose elements are the final states, (or terminal 
states), E is a subset of the cartesian product Q x X x Q whose elements are 
the edges. Let A = (Q,I,F,E) be an automaton. A path of A is a sequence 
(<7i, Gi, Qi+i), i = 1, . . . ,n, for some n > 1, of consecutive edges. Its label is the 
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word w = 0102 ... On- A word w = 0 i 02 . . . o„ is recognized by the automaton 
A if there is a path with label w such that qi G I and G F. The language 
recognized by the automaton A is the set of words which it recognizes. 

We now add some extra structure to T(/) to equip it with the structure of 
an automaton. 

Definition 3 The truth automaton T(f) = (Q,I,F,E) of f over the alphabet 
{0,1} deduces from T{f) as follows: 

1) Q is made of all the vertices ofT{f). 

2) The unique initial state is the root ofT(f). 

3) F is the set of all the leaves ofT(f) labeled 1. 

4) The edges in E are the edges ofT{f), with the same labels. 

The truth automaton of / can be viewed as the canonical disjunctive nor- 
mal form of /. This automaton exactly recognizes the n-tuples of where / 
takes the value 1 (called the support of /). In other terms the support of / is 
the (regular) language recognized by the automaton T(/). This automaton is 
obviously deterministic and accessible, but it is not complete. The completion 
of T (/) can be canonically performed by adding a non-final extra state (a sink) 
and connecting every leaf (and this sink) to the sink by each of the letters 0 and 
1. We then loose the tree structure of the underlying graph. We are now in the 
position to minimize this automaton; we shall denote the minimal automaton 
byT'(/). 

3 Minimal Automata and BDDs 

Theorem 1. The quasi reduced BDD of f is isomorphic to T'{f). 

Proof. This is a consequence of the Nerode equivalence |NE| . which can be stated 
as follows: two vertices are said to be Nerode equivalent if the sub-automata 
rooted in these two vertices recognize the same language. 

Suppose two vertices of T{f) are Nerode equivalent. This is exactly equivalent 
with the fact that the subtrees rooted at each of these vertices are isomorphic. 

The reduction step of type 2 of ROBDDs is proper to Boolean algebra: it 
means that if the substitution of Xi by 0 or 1 does not affect the value of a 
Boolean function then the function is independent of Xi, the variable is “dumb”. 
This is not exactly the same for non Boolean functions. So it’s not surprising 
that the reduction of type 2 cannot be captured by the theory of minimization. 
The truth automaton T(/) of / being acyclic, the minimal automaton T'{f) 
can be computed in time linear in terms of the number of states of T{f) m- 
So the complexity of computing the QRBDD of / starting from T{f) is 0(2”). 
Of course, this is quite unrealistic because a Boolean function is never given by 
its truth table when the number of variables is large. 
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We now consider the complexity of ROBDDs and QRBDDs. Akers m 
says that the worst case ROBDDs complexity is 0(2^/n). The result is made 
more precise in ILL] where this worst case complexity is proved to be less than 
(2"/n)(2 + e) as n grows and e is any arbitrarily small strictly positive real 
number. Some further results are given in m- 

An interesting result, coming from automaton theory, is given by Champar- 
naud and Pin m- Translated into our context, it says that the maximal number 
of states, denoted by g{n), of any QRBDD is exactly 

5(«) = 51 min(2*,2^" ‘ - 1) 

l<2<n 



and 

liminf ri( 7 (n)/ 2 ” = 1, 

n — 

limsupn(;(n)/2” = 2. 

n — >oo 

For a given n, an explicit construction of all the QRBDDs with a maximal num- 
ber of states is also suggested. The construction implies the result of jl d j| for 
ROBDDs because reduction steps of type 2 can occur only in the tail part of 
their automaton (where the number of states decays doubly exponentially) . The 
reduction of type 2 is, consequently, negligible. 

A side consequence is that we can explicitly give a family of very complex 
Boolean functions from the point of view of BDDs. Boolean functions under 
permutations of variables. 

4 Conclusion 

The conclusion is that automata are one essential part of BDD theory and that 
this part is quite independent of Boolean algebra. It follows that the same process 
(minimization) can be applied to more general BDD-like structures for integer 
valued functions, arithmetic functions . . . Conversely the BDD theory suggests 
good ideas to automata specialists. For example, we don’t touch here the cru- 
cial problem of BDDs which is the sensitivity of its complexity to the ordering 
of variables m- This must be reflected in automata theory by the notion of 
commutative equivalence of languages. In the same type of idea the process of 
identifying isomorphic subtrees can be viewed as a Fourier-Hadamard trans- 
formation (called the spectrum of a Boolean function) . This suggests there may 
exist a sort of “Fourier transform” under the minimization process for automata. 
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Abstract. We present here theoretical results coming from the imple- 
mentation of the package called AMULT (automata with multiplicities in 
several noncommutative variables). We show that classical formulas are 
“almost every time” optimal and characterize the dual laws preserving 
rationality. 



1 Introduction 

Noncommutative formal series (i.e. functions on the free monoid, with values in a 
- commutative or not - semiring) encode an infinity of data. Rational series can be 
represented by linear recurrences, corresponding to automata with multiplicities, 
and therefore they can be generated by finite state processes. Literature can 
be found on these “weighted automata” (e.g. cni, m) and their theoretical 
(e.g. ( 3 ) and practical (e.g. 0 , m) applications (recently one of us solved a 
conjecture in operator theory using these tools 0). The theory was founded by 
Schiitzenberger in 1961 CD where the link between recognizable and rational 
series is showed (see also CD), extending to rings (and to semirings 0) Kleene’s 
result for languages [3| (corresponding to boolean coefficients). In 1974, for the 
case of fields, Fliess 0 extended the proof of the equivalence of minimal linear 
representations, using Hankel matrices. All these results allow us to construct 
an algorithmic processing for this series and their associated operations. In fact, 
classical constructions of language theory have multiplicity analogues which can 
be used in every domain where linear recurrences between words are handled. 

All these operations can be found in the package over automata with multi- 
plicities (called AMULT). This package is a component of the environment SEA 
(Symbolic Environment for Automata) under development at the University of 
Rouen. 

The paper here is devoted to the study of polynomial compositions of linear 
recurrences. We point out two facts directly linked to implementation. The first 
result says that the classical formulas are “almost everywhere” optimal (this is 
clear from tests at random). The second point shows that three laws known to 
preserve rationality are of the same nature: they arise by dualizing alphabetic 
morphisms. Moreover, they are, up to a deformation, the only ones of this kind, 
which of course, shows immediately in the implemented formulas. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 183-^^3 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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2 Preamble 

Let K{{A)) be the set of noncommiitative formal series with A a finite alphabet 
and K a semiring. A series denoted S = recognizable iff there 

exists a row vector A £ a morphism of monoids ^ : A* — > and a 

column vector 7 £ such that for all w £ A*, one has {S\w) = 

Throughout the paper, we will denote by S : (A, /i, 7 ) this property and say that 
(A, 7 ) is a linear representation of S. The integer n is called the rank of the 

linear representation (A,/r, 7 ) jS]. 

Let K'^^'^{{A)) be the set of rational noncommutative formal series, that is the 
set generated from the letters and the laws (concatenation or Cauchy prod- 
uct), * (star operation, partially defined), x (external product) and -I- (union or 
sum) . The following important theorem for series d3 is the analogue of Kleene’s 
theorem for languages. 

Theorem 1 (Schiitzenberger, 1961). A formal series is recognizable if and 
only if it is rational. 

A reduced linear representation S : (A, /i, 7 ) is a linear representation of minimal 
dimension among all its representations. Existence is assumed by definition, 
unicity is proved in case K is IB or a (commutative or not) field 0 but is 
problematic in general. This minimum is called the rank of the series d- 
In case AT is a field, the rank of S is the dimension of the linear span of the 
shifts of S (see Sect. n. It is the smallest number of nodes of an automaton with 
behaviour S. Here, minimization (up to an equivalence) is possible |14ll (see also 
P). An explicit algorithm is given in full details in [7j (notice that this algorithm 
is valid as well for noncommutative multiplicities) as well as the construction of 
intertwining matrices. 

Notice that the specialisation of K to the boolean semiring IB yields to the 
case of classical finite state automata. 

3 Constructing Usual Laws 

3.1 Operations on Linear Representations 

We expound here universal formulas for constructing linear representations. 
They can be applied to any semiring K. For two representations of ranks n 
and TO, it will be provided a representation of rank r(n,m). Let us recall some 
classical facts. Classical operations on series are sum, external product and star 
(unary and partially defined) . By definition, the sum of two series R and S is 

R+S= ^ [{R\w) + {S\w))w , 

wGA* 

their concatenation (or Cauchy product) 

R.S= I ^ (i?|u)(S'|u) j w , 

wGA* \uv—w / 
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and the star of a series S 
S* = = 1 + SS* 

n>0 

if its constant term is zero (such a series is said to be proper). The preceding 
operations have polynomial counterparts in terms of linear representations. We 
gather them in the following proposition. 



Proposition 1 Let R : pr = (resp. S : ps = (A®,/r®, 7 ®)^ of rank n 

(resp. m). The linear representations of the sum, the concatenation and the star 
are respectively 
R + S: 



Pr 



R.S : 



Ps 




Y{a) 


0 


0 





a^A 




Pr Q Ps 




p^{a) 


7’'A®p®(a) 


0 


p“{a) 



a^A 




( 1 ) 

( 2 ) 



If = 0, S* : 



Ps 




/i®(o) + 7®A^p®(a) 



XY“{a) 



aeA 




( 3 ) 



Remark 1. 1. Formulas dU) and 0 provide associative laws on triplets. They 

can be found explicitly in | 2 | . 

2. Formula makes sense even when A® 7 ® ^ 0 (this fact will be used in the 
density result of the Sect. E2D. 



3.2 Sharpness 

Here we discuss the sharpness of the preceding constructions. Indeed, testing our 
package showed us that “almost everytime” the compound linear representation 
was minimal when the data were choosen at random. The crucial point in the 
proof of Theorem|2|is the fact that certain polynomial indicators are not trivial. 
For this, we use suited examples which are gathered in the following subsection. 



Test Automata Let B = (S'i)i<i<n be a finite sequence of series generating a 
stable module and S = ^i^i- It is well known that the triplet 

n n 

2=1 “ 2=1 
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(where Ci ;= (0, • • • , 1, • • • 0) with the entry 1 at place i, e* the transpose of et, 
and a~^Si = (m(o)) ^ for any letter a S is a linear representation of 

S. 

Here, to each series of one variable, S = J2p>o ctpoP i of rank n, over a field K, 
we associate the triplet t{S) given by ;B = (a“^S')o<p<n-i, of course S £ K{{A)) 
providing that a belongs to A. This will not affect the rank. 

1 

Lemma 1 Let S„ n = ^ — and T„ = be (D-senes. 

’ (1 - aa)" 1 - a" 

1. The rank of Sa,n, Sa,n + (ot yf P), and Sa,n-Sa,m are respectively n, 

n + m and n + m. 

2. The rank of T^ is n and that of T* is n + 1. 

Proof. Straightforward. □ 



Density The following theorem proves that if the data are choosen “at random” 
in bounded domains, the compound automaton is almost surely minimal. More 
precisely : 

Theorem 2. Let A finite and pi = (Ai, ^i, 71), p2 = {^2,T2,l2) two linear 
representations, be choosen “at random” within bounded non trivial disks of K 
(IR or (DJ. Then the probability that the linear representation p\ ffl p2 (resp. 
Pi □ p2, Pi be minimal is 1. 



Proof. The proof rests on the following lemma. 

Lemma 2 There is a polynomial mapping P : +2« ^ 

P(A,/i,7) = 0 iff (A,/i,7) (a linear representation of degree n) is not minimal. 



Proof. By a theorem of Schiitzenberger the representation (A, /r, 7) is min- 
imal iff Xp,{K{A)) = K^xn (j-ggp^ p{K{A))'y = as there is a prefix (resp. 

suffix) part U C A* (resp. V C A*) such that Xp{U) (resp. p{Vff) is a basis, 
we have U C (resp. V C Let = {wi,W2,--- ,Wm} (denoting 



wi := 1), one constructs the m x n (resp. n x m) matrix L = 



/ Xp{wi) \ 
Xp(w2) 



\Xp{Wm) / 

( resp. M = (/i(wi)7 • • • p{wm))), these matrices have polynomial entries in 
the data. In view of what precedes, minimality is equivalent to the non nullity 
of some n x n-minor of L and of M . Sorting these minors as a vector, one get 
the desired polynomial mapping K'^AxrT+2n j^s s = 2 (™). □ 

The other steps go as follows. 



1. For the two first operations, let P h = P (pi ffl P2), P □ = P (pi H P2), 
and prove that P □ (resp. P □) is not trivial using T{Sa,n) = Pi and 
T{Sp^n) = P2, a P (resp. T{Sa,n) = Pi and r(S'a,m) = P2 )■ For the 
star operation, prove that P = P{pi is not trivial using T{Tn) = pi. 
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2. End of the proof: if 4> • is polynomial and not trivial, let v be 

the normalized uniform probability mesure on the product of disks, then the 
probability such that (f>{v) 0 is 1 as ^“^{ 0 } is closed with empty interior. 

□ 

4 Dual Laws 

Let a,b G A, u,v G A* , and (-)e,q be the law defined recursively by 

r 1 (De,q 1 — 1, n G)e^q 1 — 1 ^ 

\ au Qe,q bv = e(a(u Qe,q bv) + b{au Qe,q f)) + qSa,ba{u Qe,q v) 
with Sa,b the Kronecker delta. 

One immediately checks that this law is associative iff e G {0, 1}. We get the 
well-known shuffle (l±j = ©i,o), infiltration (|= ©i,i) and Hadamard (© = ©o.i) 
products HI!)- Then, Qi,q is a continuous deformation between shuffle and 
infiltration. These laws can be called “dual laws” as they proceed from the same 
template that we now describe. We use an implementable realisation of the 
lexicographically ordered tensor product. Let us recall that the tensor product 
of two spaces U and V with bases {ui)i^i and {vj)j^j is C/ © E, with the basis 
{ui © For the sake of computation, we impose that the set / x J be 

lexicographically ordered. 

Let K {A) © K {A) be the “double” non commutative polynomial algebra that 
is the set of finite sums P = J2u veA* {P\u(S> v)u © v, the product being given by 
(mi © Vi){u 2 © ^ 2 ) = U 1 U 2 © viV 2 - The construction of dual laws is based on the 
following pattern. 

Let c : K{A) K{A)®K{A), if for all w G A*, the set {w : {u^v\c{w)) yf 0} 

is finite (in which case c will be called locally finite), then the sum 

U Da V = {u © v\Ca{w))w 

wGA* 

exists and defines a (binary) law Da on K{A), dual to Ca- This extends to series 
by 

{R S\w) := {R © S\ca{w)) . 

One can show easily that the three laws ©, l±j and t come from coproducts 
defined on the words by 

1 . Ca(aia2 ■■■an) = Ca(ai)ca(a2) • • • Ca(an), 

2. C 0 (a) = a © o, Cuj (a) = a © 1 -I- 1 © a, C| (a) = a © 1 -|- 1 © a -|- a © a, 

and, generally, Ce,q(a) = e(o© 1 -I- 1 © a) -I- go© a. Moreover the coproducts 
are the only alphabetic morphisms for which the laws are associative with 1 as 
unit. More precisely: 
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Proposition 2 Let K be a field, and Ca : K{A) K{A) 0 K{A) the alphabetic 
morphism defined on the letters of A by 

Ca(a) = ^ ap^qoF ® 
p,q>0 



with Cq,(1) = 1 0 1. 

1. The morphism Ca is locally finite iff aop = 0. 

2. Providing oo,o = 0, the following assertions are equivalent. 

(a) The law defined by (rt t’l'w) := (u 0 r)|co,(w)) (u,v,w G A*) is 
associative. 

(b) The coefficients ap^q satisfy the relations ttp^q = 0 for p or q > 2, 
0 ^ 0 , 1 ) cn,o G {0,1} and ao,ic«i,i = aipaip. 

3. Providing the element 1 a* is a unit for iff = oi^o = 1- 

Proof. 1. We have Ca{a) = oo,ol 0 1 + ^ ap,qa^ 0 a'^, and then for all n > 0, 

p+q>l 

Ca(a") = ao.o^ 0 1 + ^ fip.qoP 0 a'^ for some fip^q. If oo.o were not zero, 

p+9>l 

the term 1 0 1 would appear in an infinity of words, and then Cq, would not 
be locally finite. 

Conversely, if ao,o = 0, then Cc{a) = ctp^qoF 0 a'^ and for all word 

p+q>l 

W = ai ■■■ Qn & A* , 

Ca{w)= 

Pi+Qi>l \i=l 
l<i<n 

n 

As Pi + qi > 1, we have 'Y^{Pi + qf) > n, that is to say |?r;| < |u| + |r;| and 

i=l 

Alph{w) = Alphfu) U Alph{v) with u := off ■ ■ ■ affi' and v := af • ■ ■ a^". The 
alphabet Alph{u) U Alph{v) being finite, the set {w/{u 0 u|cq(w)) 0} is 
finite. 

2. First remark that the two assertions are equivalent to the condition 

{Id 0 Ca) O Ca = (Ca 0 Id) O Ca (4) 

The law is associative iff for all words ui, U 2 , U 3 G A*, we have 
{uiUaU2)UaU3 = Ml D q, (m2DqM 3) 
that is to say that, for all w G A*, 

((MlDaM2)naM3|w) = (ui (M2naM3) I w) . 






0a{ 
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But one has 

{{uiaaU2)DaU3\w) = {{uiaaU2) 0 U3|Ca(w)) 

= {Ui 0U2 0 U^\{Ca 0 Id) O Ca{w)) 

and 

(uiaa{u2aaU3)\w) = {ui (g) (rt2naM3) |Ca (w)) 

= (Ui (g) U2 ® UsKid (g Ca) O Ca(w)). 

As ui, U 2 , U 3 , w are arbitrary, we get (cq, (g) Id) o Ca = {Id 0 Cq) o Cq,. To 
show the equivalence between (12 HI and (@J, suppose first that holds. 
We endow IN^ with the lexicographic order (reading from left to right for 
instance) which is compatible with addition and will be denoted A (here, 
k = 2,3). Then, if it is not zero, Ca(a) can be written 

ap,qoI 0 ^ Op.goP 0 , 

{p,q) being the highest couple of exponents in the support. Then, 

(cq 0 /d) o Co,(a) = ap,qCa(a^) 0 a^+ ^ ap^qCa{a^) ® a'^ 

ip,i)^{p,g) 

= ^ ® aF + ^ ® ® a\ 

but 

{Id'^ Ca) o Ca{a) = ap^goF Ca{aF) + ^ ap^qaP ® Ca{a'^) 

ip,i)^{p,g) 

= oF 0 oF‘^ 0 a® + ^ Pp,q,roF 0 a'^ 0 a’’. 

(P,q,r)^(p,pq,q^) 

Necessarily, p = and q = q^, which is only possible when p G {0,1} and 
9 € jo, 1} and then ap^q = 0 for p or q > 2. The equality now reads 
01,00 0 1 0 1 + Oopl 0 1 0 o + oopaipo 0 1 0 o 

aj qG 0 1 0 1 + oo.il 0 1 0 o + ai,oOi,io 0 1 0 o, 

which implies dZQ). The converse is a straightforward computation. 

3. The condition 1a* is a unit for Dq, implies that, for a G A, we have 
lUaG = oDal = a (IDaOjo) = (aDaljo) = 1 

(1 0 a|cc(a)) = (o 0 1 |cq(o)) = 1 

r (1 0 o| Ep.9>0 0 0«) = 1 

I (a ® 1| Ep,9>0 0 0«) = 1 

Oo,i = ai,o = 1- 

Now, this condition implies that, for each w G A*, iDaW = wDq,1 = w. 

□ 



Remark 2. For just a commutative law the condition oo.i = Q^i.o is sufficient. 
Moreover, the condition m implies oo,i,cii,o G {0,1}. 

The preceding computation scheme has an immediate consequence on the im- 
plementation of the laws. 



190 Gerard Duchamp, Marianne Flouret, and Eric Laugerotte 



Proposition 3 Let R : (A’’, /r’’, 7’') and S : (A®, ^*,7®). Then 
R da S' : (A’’ (g) A®, (g) ^® o Cq,, 7’’ (g) 7®) . 

Proof. Computation. □ 

4.1 Shuffle and Infiltration Product 

Proposition 4 Let R : (A’', /r’’, 7’') (resp. S : (A®,^®,7®)j with rank n (resp. 
m). 

1. Representations of shuffle and infiltration products are respectively 

SlxjS : (A, p,, 7) = (A’' (g A®, {pf {a) (g /„ + /„ (g /r®(a))^g^ , 7’' (g 7®) , 
and 

i? t S : (A’' (g A®, (^’'(a) (g /„ + /„ (g /r®(a) + p'^ {a) (g /r®(a))aeA, 7 ’' ® 7 *) • 

2. The bound nm is sharp in both cases. 

3. The density result of theorem n holds. 

Proof. Concerning point an example reaching the bound for any rank is 
to consider the families of series S„ = and T„ = 6"“^ of rank n. The 

shuffle product S„L±jSm = {a ^ b G A) has a minimal linear 

representation of rank nm. The same example is valid for the infiltration product 
as, for a 6, a” t = a”uj5"'. □ 

4.2 Hadamard Product 

We recall that the Hadamard product (0: cni) of two series is the pointwise 
product of the corresponding functions (on words). We can use the machinery 
above to describe a representation of it. 

Proposition 5 Let R : (X~,p^,Y) (resp. S : (A®,^®,7®)^ with rank n (resp. 
m). A representation of the Hadamard product is 

i? © S : (A'' g) A®, (p'^ia) © p‘"{a))^^^ , 7'' © 7®) , 

and the bound is asymptotically sharp. 

Proof. Let /3(n, m) := sup r.ank(R)=n rank{R © S). We claim that 

rank{^S') = m 

P{n,m) 

iimsup = 1 , 

n,m — >+C50 TlTTl 

(what we mean by “asymptotically sharp” ) . 

Indeed, let us consider the Hadamard product of two series of the family 

k>0 



(1-a^) 
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The rank of Sn is n, and 

Sn © Sm = X)fe>0 ® Sfe'>0 

E „Zcm(n,m)fc C 

;c>0 ^ — *^Zcm(n,m) • 

Thus, for n and m coprime, the rank of the product is nm, which proves the 
claim. □ 
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Abstract. We present biba, a package designed to deal with represen- 
tations of large automata. It offers a library able to build, even on a 
modest computer, automata where the sum of the numbers of states and 
edges achieves one billion or more. Two applications that use this library 
are provided as examples. They build the reduced automaton for a given 
vocabulary, and the suffix automaton of a given word. New programs can 
be developed using this library. In order to overcome physical memory 
limitations, biba implements a paging scheme, in such a way that the 
automata really reside on disk, making possible their permanent storage. 
Through a simple interface suited for perl, small scripts can be easily 
written to use and extract informations from these automata. 



1 Introduction 

Large and complex automaton-based data structures may arise when dealing 
with massive data sets. These automata can be viewed as ways to extract in- 
formation from those sets, as well as ways to put and maintain them in a form 
suitable for specific applications, like search engines or spelling checkers. In this 
second case, the automaton is not a temporary need, but one definitive format 
under which the data set (dictionary) will be maintained. 

Developed in a first moment to apply statistical tests in biological sequences 
[10], biba is mainly a C programming library, but it also contains programs built 
over this library offered as applications. It is able to make user-level memory 
paging, in the same sense that this term is used in operating systems textbooks 
(see for instance Bach [1]). However, biba’s paging implementation is not only 
a way to overcome physical memory limitations, but it is also and mainly an 
efficient way to permanently store automata on disk. 

Paging will be a very interesting option in those cases where the automaton 
can compare to a large relational database, that resides on disk but is continu- 
ously being consulted and changed punctually along a large interval of time. An 
example we can think about is an automaton-based index for a large dynamic 
textual database, as glimpse [7] does. 

In this paper we present the main features and characteristics of biba. A 
complete technical description may be found along the man pages and the source 
code, freely available in the URL given at the end of the paper. It was tested by 
the author in x86 and axp Linux, but the library will compile and run with 
few or no changes in almost any 32 or 64-bit platform for which a C compiler is 
available. 
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(c) Springer- Verlag Berlin Heidelberg 1999 



Paging Automata 193 



2 An Overview from the User’s Viewpoint 

We introduce the features of biba, through some examples. For now, the visible 
part of biba are just two programs that integrate the package, reduce and 
suffaut. 

The reduced automaton that recognizes a set of words, given one per line in 
a file, can be built using the reduce tool, that implements the algorithm due to 
Revuz [8]: 

$ reduce words 

The automaton just built resides in the pair of files a.aut.st (the states) and 
a.aut.ed (the edges). These are ordinary files, that you can copy, compress or 
transfer. Because of practical limitations that the user may need to deal with, 
we allow to split the files that contain an automaton in many parts, so one can 
combine the available areas of many filesystems, and take advantage of using 
disks in parallel. For instance, the following command will divide a.aut.ed in 
two parts, the first of them containing just 25 megabytes: 

$ biba -s 25 a.aut.ed a.aut-l.ed 

Next we present suffaut, a program that builds the suffix automaton of a 
given word, using the algorithm due to Blumer et alii [2] . In this same example 
we show how one paging parameter (among others) can be controlled through 
options. In this case we’re specifying that the maximum ratio between total 
virtual memory size and total physical (RAM) memory size is 5. Just like in the 
first example, the automaton will reside in the files a.aut.st and a.aut.ed. We 
chose a binary word (the file /bin/sh seen as a sequence of bytes): 

$ suffaut -p 5 -f /bin/sh 

word length: 300668 

states: 428275 (14 finals) edges: 636565 

ram pages: 1300 virtual pages: 2141 



This command took 64 seconds to complete in a 486 lOOMhz with 32 mega- 
bytes of main memory. By default suffaut allocates 5 megabytes of RAM before 
allowing the ratio between virtual and physical memory becomes larger than 1. 
As the page size is 4096, suffaut allocated 1300 RAM pages. The final sizes for 
files a.aut.st and a.aut.ed were 5.5 and 3.2 megabytes, so suffaut allocated 
13 bytes per state and 5 bytes per edge. These numbers will be explained in the 
next section (note that the algorithm requires two additional 32-bit fields per 
state). 

Now we’ll combine these two programs in order to test the accuracy of both. 
Let’s consider Good-de Bruijn words over the alphabet {0, 1}. The script good 
generates such a word with parameter p, that is, a word w where each element 
of {0, 1}P occurs just once as a factor of w. The number of states of the sufhx 
automaton of w can be theoretically computed as being 2^’+^ — 1 [10]. We can 
build this automaton using suffaut or reduce, and compare the results: 
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$ good 7 I suffaut 

word length: 134 

states: 255 (8 finals) edges: 381 

$ good -s 7 I reduce 

states: 255 (8 finals) edges: 381 



In order to generate the word, good uses the classical construction based on 
Eulerian Tours. When called with the -s switch, it generates the word and all 
its suffixes. This script is available with the biba distribution. 

In the next two examples biba deals with natural language dictionaries and 
biological vocabularies. In both cases we’re building reduced automata, in the 
first case for the entire br.ispell dictionary for brazilian Portuguese [11] (just 
like done by Lucchesi and Kowaltowski in [5]), which contains 195539 entries 
(total size 2120562), and, in the second, for the blocks protein database [4] 
(109660 entries, 3249189 total size). The filter eb, which extracts the sequences 
from the database distribution file removing comments, is provided in the biba 
distribution. 

$ ispell -e vocab -d br | reduce 

states: 14119 edges: 36588 

ram pages: 1300 virtual pages: 1303 



$ eb I reduce -s 200000 -m 28 -c 28 -d 1000000 
states: 1318407, edges: 1401826 
ram pages: 6200 virtual pages: 6182 



The first command took just 240 seconds to complete in the same 486 com- 
puter described. The second took 27 minutes. Note that this time may vary 
largely depending on the choice of the paging parameters. We’re omitting the 
final size of the files a.aut.st and a.aut.ed because in these cases they’re full 
of garbage. In fact, their sizes are not determined by the automaton size, but 
by the vocabulary size, because the algorithm first builds a trie (this behaviour 
may be partially avoided using a command-line option). Anyway, at the end of 
the construction, each state uses 9 bytes, and each edge uses 5 bytes. 

The library can also be used for many other purposes, but develop and link 
programs with the biba library is a time-consuming exercise of C programming, 
so we offer a simplistic alternative that opens partially the library’s internal func- 
tionality for scripts that can be quickly written using perl or similar languages. 

One trivial script is available in the distribution. It reads the file specified 
as its first parameter, reads each line, checks if it is or not spelled by a given 
automaton, and prints it in the affirmative case. Hence, this script can be used 
as a filter. A more elaborate example is also contained in the biba distribution. 
It computes the complexity of a biological sequence as define by Trifonov in [9] . 
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3 Short API Description 

For the programmer we provide a short description of the C API (for a detailed 
description please refer the man pages). Just like we said that automata are 
like files to the user, they are like files to the programmer too. So deal with 
automata using this API is like deal with files through the C language library. 
Automata handles are provided, and they can be used to create or specify an 
existing automaton, and to subsequently work on it. 

create_aut similar to create(2) 
register _aut similar to open(2) 
closeout similar to close (2) 

In order to deal with automata states and edges, very simple services are 
provided. Each one require the specification of a handle, amongst other specific 
parameters. The complete list follows: 



new_state alloc a new state 

setJinal set (to a given value) the state final flag 
isJinal read the state final flag 
set_dest set the destination of an edge 
dost read the destination of an edge 

stdeg read the output degree of a state 
next_dest visit all edges leaving a given state 
clone clone a state 



For low level configuration and session recovery support, the following ser- 
vices are provided: 

init_aut initializes the API and specify paging parameters 
stop_pgout disallow page outs, in order to keep 

flush^ut synchronize the disk flies with the contents of RAM memory 



4 Internal Representation of Automata 

States and edges reside on separate data structures, that grow as needed (cur- 
rently they cannot shrink). The main goal of these data structures is to save 
memory, because we want to be able to represent large automata, so we need to 
try to minimize paging at the expense of a larger cost of in-memory operations. 

The data structure that represents states is nothing more than a large array. 
Each entry of this array is a “state” , composed of flxed flelds (its output degree, a 
flag to identify if the state is final, and a pointer to the structure that represents 
its edges) and application-dependent flelds, like flags and counters. 

The data structure that represents the edges is an array too. It limits the 
alphabet size to 256, and currently does not support nondeterministic automata. 
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In order to represent the edges that leave a state with output degree d, where 
d is in the range 0..256, it allocates just d contiguous entries of the array of 
edges. The allocation routines guarantee that these d entries reside on the same 
page, a fundamental property, needed to avoid performance degradation. Each 
edge is composed by a label field and a destination field which points to a state. 
Application-specific fields are not allowed. 

To allocate a state is no more than increase by 1 the current size of the array 
of states, and manage the necessary changes in the data structure, which may 
include the allocation of a new page. Inserting an edge leaving a state demands 
the allocation of d contiguous entries of the array of edges (best-fit is used), 
where d is the resulting output degree. This is managed by the maintenance of 
free lists of free areas within this array, one per each possible free area size. 

Both arrays reside on disk, but at each moment only some of their pages are 
loaded into memory, accordingly to the paging routines. Each state and each edge 
uses just 40 bits to be represented (the size of a state may be larger depending on 
the presence of application-specific fields). This number is due to current limit 
on the alphabet size, that cannot be larger than 256, plus the choice of a 31-bit 
addressing limit for states and edges. So if n is the sum of the number of states 
and edges of an automaton, then the total disk size needed to represent it will 
be 5n bytes. 

Page replacement proceeds swapping out pages based on their ages. The 
pages are circularly visited along the program execution, just like a memory 
refresh circuit does. The age of a page is the amount of time elapsed since the 
last time it was acessed. Comparing the age of a page with the medium age 
computed over all pages, one can decide if it will be swapped out or not. 



5 Limits and Limitations 

The limits imposed by addressing strategies or performance requirements follow. 

We plan to relax some of them in future releases of the library (see the section 

Future Directions). 

— The states of one automaton can use a virtual space of at most 8 gigabytes. 
The same limit applies to its edges (this is imposed by a page addressing 
limit). 

— Paging makes unfeasible to adopt a platform-independent representation of 
data, so when moving automata from one computer to another, conversion 
tools may be needed (such tools are not currently available). 

— To avoid performance degradation, it is not currently allowed to attach 
application-specific fields to edges nor to use alphabets with sizes larger 
than 256. 

~ Nondeterministic automata are not currently supported. 

— Automata splitting in many files is currently limited to 10 parts for the states 
and 10 parts for the edges. So an automaton may be split in at most 20 files. 
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6 Notes on Performance 

The paging scheme by itself represents a performance penalty, because of the 
additional memory and time comsumption needed to manage and detect page 
faults and access memory. In a user-level paging scheme like the one implemented 
in biba, this penalty is more significant because we cannot use, as the operating 
system kernel does, hardware features that make the paging more efficient (and 
unportable) . 

In order to estimate the time penalty, we depend on more programming 
effort. Using the suffix array tool in development (details in the next section) 
we’ll be able to make it. 

On the other hand, regarding the memory usage, paging is not critical. The 
tables and additional information needed to manage paging by biba represent 
less than l.Samount of physical memory allocated (details on how this percentage 
was computed can be found in the file mem.c of the distribution). 

7 Future Directions 

Regarding edges representation, some enhancements are needed, like support 
for nondeterministic automata. Changes are being planned in order to allow 
the addition of application-specific fields to edges, as well as the usage of larger 
alphabets and/or support automata where the labels of the edges are words over 
a given alphabet. 

A search tool for WWW servers similar to glimpse [7] will be released soon. 
It currently uses an index based on GNU gdbm, but an automaton-based 
alternative is being added. 

More general data structures besides automata can be represented using 
biba. We plan to release soon a program that builds the suffix array, and im- 
plements both the refinement algorithm due to Manber and Myers [6] and the 
external sorting algorithm due to Gonnet [3] . We wrote this program some time 
ago, and it is currently being adapted to use the biba paging scheme. 

A full access to the internal library capabilities though perl scripts is being 
worked on. We currently do not have a schedule for other automata or graph 
algorithms to be implemented, but we’re accepting suggestions. 
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9 Availability 

The sources are available under the terms of the GNU GPL, and can be copied 
from ftp://ftp.ime.usp.br/pub/ueda/biba. General, programming and last-min- 
ute informations are available at http://www.ime.usp.br/~ueda/biba. 
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Abstract. In the present paper we consider the abstract background 
for designing a practical graph-based computational environment with 
variable, optional semantics. From the variety of possibilities we concen- 
trate on graphs and polynets as possible carriers of the syntax, and finite 
automata and flow-diagram programs as possible semantics. We discuss 
the encapsulation property which emerges in such systems and give pre- 
cise description of the syntax, operational and denotational semantics in 
terms of Category Theory. A data structure capable to meet the require- 
ments of a graph-based computational environment is sketched at the 
end. 



Graph-based representations are a usual vehicle for describing processes or 
events in contemporary science. In particular, they can be used to describe com- 
putational processes. Finite automata were possibly the first attempt in this 
direction. The flow-diagram descriptions of algorithms are another typical ex- 
ample, as ancient as the programming itself. Finite automata and flow-diagrams 
attracted the attention of the scientists because of their simplicity and trans- 
parency, and were subject of active theoretical research. Nevertheless, this re- 
search did not result in a practical graphical computational environment for the 
absence of appropriate substantial medium which can comfortably accommodate 
something more than the elementary cases. With the contemporary advanced 
technology and its ability to manipulate visual information in real time there 
is an actual possibility for implementation of such an environment and graph- 
based computational languages. Computational languages of this type fall in the 
class of the visual programming languages. In the present paper we analyze the 
theoretical background of their possible syntax, semantics and implementation. 

The subject of discussion in the further exposition is an environment for geo- 
metrical, interactive manipulation of graphs which under appropriate restrictions 
on the form can carry as options a variety of possible semantics and implement 
computations prescribed by the currently selected one. 

We will restrict our attention to the two most natural candidates for possi- 
ble semantics - finite automata and flow-diagram programs. Finite automata in 
all their different forms are a well developed concept naturally connected with 
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graphs. On the other hand flow-diagram programs are the most transparent rep- 
resentation of the structure of algorithms. In a sense, flow-diagram programs are 
another form of representation of finite automata more appropriate for the task 
of programming, where the permanent tracking of the global state of the system 
is an unnecessary burden. 



1 Alternatives and Their Formal Background 

Any particular program or automaton in the considered environment has two 
sides - an external appearance, which is a geometrical picture on appropriate 
medium (say the display of a computer), and its internal representation. The 
link between these sides is an abstract structure similar to graphs. Both the 
geometrical and the internal representation in any particular case have as a core 
an isomprphic image of an instance of such a structure. Therefore, an abstract 
mathematical structure is underlying any representation. In this section we will 
discus briefly several candidates which can be used as a basis. The ordinary 
directed graphs are the most natural candidate and a good starting point for 
our analysis. 
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Fig. 1. 



We consider a directed graph as a tuple 
G = {N, A, sor, tar) 

consisting of a set of nodes Nd(G) = N, a set of arrows Ar(G) = A, and two 
functions sor, tar : A ^ N which put in correspondence to each arrow its source 
and target nodes respectively. This understanding of graphs has the advantage 
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that it permits consideration of several arrows in the same direction between two 
nodes. It can also be directly embedded in the framework of Category Theory for 
resolving the semantic problems. A finite graph of appropriately restricted form 
can be considered as a representation of a computational process. The imposed 
restrictions depend on the intended semantics. 

If a graph is to be interpreted as a deterministic finite automaton of Mealy 
type the nodes must be labeled with states, the arrows with pairs of input/output 
signals (see FigC^jC considering a,b,c and d as names of such pairs). Further- 
more, appropriate rules for the flow of the control must be imposed. Additional 
restrictions may require that the arrows outgoing from a node must be labeled 
with different input signals (determinism) and that there are enough outgoing 
arrows for each node to cover each input signal (totality) . The sematic goals of an 
interpreted graph can be described on abstract level using appropriate category 
associated with it. 

If a graph is to be interpreted as a flow-diagram, according to the traditional 
flow-diagram approach, the nodes must represent computational steps and the 
arrows must represent transitions from one computational step to another i.e. 
the control. Typically, the nodes of the graph are divided into two groups: nodes 
for actions and nodes for decisions. Only one arrow can originate at an action 
node and usually only two arrows can originate at a decision node. A node is 
specified as the beginning of the prescribed process and several nodes can be 
specified as its ends. A general restriction on the form of a flow-diagram (if only 
deterministic processes are of interest) requires that any two paths from one 
node to another must branch through a common decision node. 

An alternative possibility to the traditional flow-diagrams, advocated by 
Landin 0, Burstall P, Goguen [3|, is to consider the arrows as computational 
processes and the nodes as data containers (FigJD. Except for its flniteness no 
other restrictions on the form of the graph are necessary for it to represent a 
computational process. Appropriate restrictions are imposed on a semantic level 
instead. Arrows with the same source are considered as alternative. Alternative 
arrows are used to represent decision steps and have a special interpretation. We 
will give preference to this approach not only because of its connection with Cat- 
egory Theory but also because any computation, working on data and producing 
results, creates a sense of direction. In addition, it can also reflect the parallel 
structure of the algorithms and is related to the typed functional programming. 

A bare graph of a system as described above represents the syntax of a 
program. From the point of view of the geometric image a system which manip- 
ulates the syntax must be able to create/delete nodes; to create/delete arrows 
and associate them with the nodes; to select nodes and arrows and to move 
them to different positions in the medium; to create/delete intermediate points 
(pseudonodes) for changing the direction or the shape of the arrows. In addition 
the following encapsulation property is not only desirable but also necessary for 
the system to be able to give a global, bird-eye view of the picture and make it 
more observable: 



202 



Yuri Velinov 



the system must be able to encapsulate (shrink) the objects in an out- 
lined area into an arrow keeping intact the outside structure of the graph, 
and also be able to restore the encapsulated arrow back to the original 
form which it represents. 

In essence, this means the ability to shrink a subgraph to an arrow (correlate 
FigCJ^ and FigHb, FigO: and Figl^tt, Fig|2:, and Fig®). The ability of the 
system to meet such a property depends strongly on the intended semantics 
and cannot be fulfilled completely. Intuitively an outlined area in the case of 
automata interpretation may select something as a ’’subautomaton” (not in the 
algebraic sense of this term) or in the case of a flow-diagrams - a subprogram. 
Forgetting the semantics for a moment, even on a purely syntactical level, the or- 
dinary graphs can meet this requirement in a very limited form: only an isolated 
chain of arrows can be represented in compressed form as one arrow. Otherwise, 
multi-head multi-tail arrows must be introduced (FigE|). This leads to the con- 
clusion that in general another more rich mathematical structure should be taken 
as an abstract basis instead of the ordinary graphs. We will consider polynets 
and polygraphs (generalizations of graphs where the arrows are multi-tail, multi- 
head dragons) as an alternative possibility. 




Fig. 2. 



A polynet 

N = (N, A, sor, tar) 

consist of a set of nodes Nd(N) = N , a set of polyarrows Ar(N) = A and two 
functions sor : A V{N) and tar : A — > V{N). The functions sor and tar put 
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in correspondence sets of nodes to each polyarrow. Thus a polyarrow can have 
several source nodes and several target nodes. 

A structure which can be used as an intermediate step for categorical treat- 
ment of the semantics of polynets are polygraphs mn Similarly to graphs, a 

polygraph 

{O, A, -h, A, dom, cod) 

consists of node- objects O, arrows A and two functions dom,cod : A — > O which 
put in correspondence one source object and one target object to each arrow. On 
the other hand the objects haue structure: together with the operation : Ax A ^ 
A and the unit object A they form a monoid. Additional requirements can be 
imposed on the monoidal structure to be able to treat more specific cases. For 
example, Meseguer and Montanari jSj consider commutative monoids to describe 
processes in Petri Nets. In the cases appropriate for our purpose the monoid of 
objects is the free monoid generated by a set of nodes N. Thus an object is 
a tuple of nodes. Under this restriction polygraphs resemble polynets in that 
each arrow can have many nodes as sources and targets but in a polygraph the 
source nodes or the target nodes of an arrow are organized as tuples and because 
of this repetitions are possible. In essence, this means that different nodes can 
be interpreted to range over the same domain. A finite polynet in which the 
nodes are linearly ordered can be considered as a generator of a polygraph. The 
monoid of the generated polygraph is the free monoid over the nodes. The arrows 
stay the same but their sources and targets are determined as tuples taking into 
account the order imposed on the nodes. Though favorable from the point of view 
of the semantics, polygraphs are not convenient for geometric representation 
even only for the reason that the free monoid, containing an infinite number of 
elements, cannot be displayed. This is why it is better to consider them only as 
an intermediate step to the semantic constructions. 

The semantic constructs for ordinary graph-programs can again be given in 
the framework of Category Theory. 

A category is a graph with an additional binary partial operation on arrows 
subject to specific axioms. Formally, a category is a tuple 

{O, A, o, doTO, cod, I) 

where {0,A,dom,cod) is a graph with a set A of arrows and a set O of nodes 
ealled objects, o:AxA^Aiso partial operation on arrows, and I \ O ^ A is 
a function which associates an arrow called identity arrow with each object. The 
eomponents of a eategory fulfil the following axioms: 

— The eomposite xoy of any two arrows x and y exists iff dom{x) = cod{y), 

and in this case dom{x oy) = dom{y) and cod{x o y) = cod{x); 

— For any three arrows x,y, z, {x o y) o z = x o {y o z) provided the described 

composites exist; 

— For any arrow x, x o I(dom(x)) = I(cod(x)) o x = x. 
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More information about categories can be found in Pli) It is easy to recognize 
the underlying graph structure of any category correlating the functions dom 
and cod with sor and tar respectively. A typical example of a category is the 
category SET which has sets as objects and total functions as arrows. We will use 
the category CFN of the computable functions, which has constructive domains 
as objects (domains generated from a finite initial set by several construction 
operations) and computable functions as arrows. 

Polycategories are similar to categories but have additional monoidal struc- 
ture on objects and an operation on arrows which includes objects as a necessary 
specification. Formally, a Y— polycategory is a tuple 

(O, A, -I-, A, dom, cod, 7 , 1 ) 

where (0,A,+,A,dom,cod) is a polygraph, I : O ^ A and A x O x O x 
O X A —>■ A is the arrow operation which, when defined, puts in correspondence 
to each tuple (x,a,b,c,y) an arrow denoted as x\a,b,c\y (the triple of objects 
in this notation specifies the way the two arrows x and y are composed). The 
components of a polycategory fulfill seueral axioms as follows. 

For any arrows x, y, z, and objects a, b, c, d, e : 

— x\a, b, c\y exists iff dom{x) = a + b + c and cod{y) = b; 

— if x\a,b,c^y exists then cod{x\a,b,c^y) = cod{x), dom{x\a,b,c\y) = a -I- 
dom{y) + b; 

— {x\a,b,c + d + e~\y)\a + dom{y) + c,d,e~\z = {x\a + b + c,d, e~\z) [a, b,c + 
dom{z) + e~\y provided the described composites exist (commutativity^; 

— {x\a,b,c'\y)\a + d,cod{z),e + c^z = x\a,b,c'\{y\d,cod{z),e~\z) provided the 
described composites exist (which requires that b = d + cod{z) + e) (associa- 
tivity;,- 

— I{cod{x))\A,cod{x),A^x = x\A,dom{x),A^I{dom{x)) =x 
FigOl illustrates some of the axioms. 

A typical example of a polycategory is the polycategory VFN which has 
sets as objects and functions as arrows and where the monoidal operation is a 
slightly modified (because of A) Cartesian product. We will use the polycategory 
VCF of computable functions which has constructive domains as objects and 
computable functions as arrows. More information about polycategories can be 
found in 00 . 

Structures of the type considered above can be related by morphisms. A 
graph morphism F : Gi ^ G 2 between two graphs Gi and G 2 , is a pair of 
functions 

{F, : Nd(Gi) ^ Nd(G 2 ), F^ : Ar(Gi) ^ Ar(G 2 )), 

which preserve the sources and the targets of the arrows i.e. for any arrow 
X G Ar(Gi), sor{F^{x)) = F,{sor{x)) and tar{F^{x)) = F,(tar{x)). Category 
morphisms are called functors. A functor is a graph morphism on the graph 
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parts of the related categories and in addition it preserves the arrow composi- 
tion operation and the identity arrows. The precise definition of a functor can 
be found in The definition of a Y-polycategory functor is given in 00 

If polynets are used as an abstract basis the programming environment be- 
comes more flexible and is able to accommodate other possible interpretations. 
For example, multi-input/output automata can be considered as building blocks 
of more complex systems obtained by connecting together some of their input 
and output lines (Fig[3D). In this case (as it is with the flow-diagrams) nodes 
keeps the input/output signals. Logical circuits or Neural Nets are specific cases 
of such interpretation. Systems of Petri-nets type used for describing compu- 
tational processes can be a source of other interpretations. Just to underline 
how rich the area of possible interpretations really is remember that even the 
understanding of automata can be different - they can be of different types, say 
Mealy or Moore, and they can also be considered as transducers or as acceptors; 
a different flavor may be, but one which requires different interpretations and 
semantic treatment. 



2 Abstract Structure of the Graph-Based Programming 
Languages 

The analysis carried out above gives enough background to outline the struc- 
ture of some graph-based programming languages. A thorough exposition of the 
variety of feasible interpretations is impossible in a small paper. In the present 
section we will give a more thorough description of the syntactical and semantic 
sides of a computational environment for automata and flow-diagrams. From the 
alternative possibilities considered in the previous section we select the ordinary 
graphs and the polynets, intending the arrows to represent computations. 

Any finite graph can be considered as a graph-automaton scheme. In order 
to describe the automata interpretations of such a scheme it is convenient, given 



206 



Yuri Velinov 



two sets of signals X and Y, to denote by {X x Y)* the subset of X* x Y* 
consisting of all pairs of strings of equal length (notice that there is a natural 
isomorphism between (X x Y)* and {X x Y)* which makes it possible to consider 
strings of pairs instead of pairs of strings whenever it is more convenient) . Having 
in mind a slightly generalized version of deterministic finite automata of Mealy 
type, necessary to accommodate a limited form of encapsulation, we can give 
the following definition. 

Definition 1 A graph-automaton (G, Y, Y, 7) consists of a finite graph G, 
two finite sets - X of input signals and Y of output signals, and a labeling 
function 7 : Ar(G) ^ {X x Y)* . 

A graph- automaton is elementary if the codomain ofj is restricted to XxY. 

An elementary graph- automaton is total iff the number of the outgoing ar- 
rows of each node labeled with different input signals is Card{X). 

An elementary graph- automaton is deterministic if (but not only if) all 
arrows outgoing from a node are labeled with different input signals. 



In this paper our interest is restricted to the deterministic automata. How- 
ever, we restrain from presenting the cumbersome conditions for determinism 
in the general case. Instead, we will consider only automata which originate 
from deterministic elementary automata by encapsulating sequences of arrows 
as described later. 

With each graph automaton (G,X, Y, 7) the category of paths over G is 
associated and naturally interpreted. In this category every two arrows Og 
oi , oi q 2 such that the target of the first coincides with the source of the 

second can be composed to produce the arrow oq ^ ^ Identifying the 

arrows of the category with their labels, the semantics of a graph-automaton can 
be determined denotationally as a family of functions (/o : X* Y*) associated 
with the nodes of the graph. For each node o 



fo{xiX2.:Xn) 



the second component of the arrow xiX2...Xn/yiy2---yn 
whose source is o. 



(this family is nothing more than a form of representation of the well known 
extended output function of an automaton). 

The operational description of the semantics is obvious: trace the graph fol- 
lowing the arrows, according to their labels, in the order of the input signals and 
output the second signal of the label of each passed arrow. 

A graph-automaton of the considered form permits encapsulation only of an 
area with one incoming arrow and several outgoing arrows which comprises a tree 
subgraph (the internal leaves, if any, are always considered as targets of outgoing 
arrows). The encapsulation in such a case results in a bunch of arrows with the 
same source and different targets. Each arrow is the composite of the original 
arrows in a path which starts with the unique incoming arrow and finishes with 
an outgoing of the area arrow. 
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A more general encapsulation property encounters difficulties. Parallelism 
can be treated in the framework of the partially additive categories 0 but 
Category Theory does not have appropriate mechanisms to handle loops. An 
additional unary operation (star operation) can be introduced for this purpose 
and this may lead to an enriched category structure capable of handling regu- 
lar expressions. Still the situation will not be quite satisfactory since the cor- 
respondence between regular expressions and automata is not one to one. The 
introduction of net-automata where multi-tail-single-head (or single-tail- multi- 
head) arrows are possible may also be interesting. In general, this direction is 
not well developed and requires further study. 

Aiming to interpret graphs as prescriptions for computational processes we 
may follow the ideas of Burstall PP and Goguen 0 . 

Definition 2 A graph-program scheme (G,i,e) is a finite graph G where 
two (not necessarily different) nodes i and e are selected as an initial node and 
a final node respectively and where every node is reachable from the initial node. 

A simple example of a graph program scheme can be seen in Cb where the initial 
node is marked with o and the final node is marked with -I-). 

An interpreted program scheme becomes a graph-program. To interpret a 
program scheme (G,i,e) we map it by a graph-morphism into the category 
of computable functions fulfilling the additional restriction that the alternative 
arrows outgoing from a node must have as images functions with disjoint domains 
of definition. Thus, formally: 




Definition 3 A graph-program is tuple P = (G,i,e, F) where (G,i,e) is a 
graph program scheme and F : G — > <CFN is graph-morphism which puts in 
correspondence constructive domains to the nodes and computable functions to 
the arrows, subject to the restriction that the functions associated with alternative 
arrows have disjoint domains of definition. 

Fig01 shows a graph program for computing of n!. 
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The semantics of a graph-program P is a computable function. It can be 
determined in operational manner as follows: 

• Associate with each node a variable ranging over its corresponding set (or a 
tuple of variables if the corresponding set is a product); 

• Assign initial values to the variables of the initial node; 

• In all possible cases consume the values assigned to the variables of a node 
by computing the function which has it as a source and assigning the value 
of the function to the variables of its target node. In any case of alternative 
arrows choose the one which represents function defined for the currently 
available value in its domain; 

• Continue the above process until the final node is reached. The obtained 
value from the set associated with the final node is the value of the function 
determined by the program. 

Notice that with this approach we may think that a function is sensing its 
domain for availability of data (compare this observation with the token game 
for net-programs described below). 

The semantics of a graph-program P = (G,i,e,i^) can be given in de- 
notational manner using the technique of Category Theory. The graph G of 
the program generates a free category G* - the category of paths in G (the 
arrow operation simply joins two connected paths together). The graph mor- 
phism F : G <CFN can be extended in a unique way to a category functor 
F* : G* — > <CFN. It can be proved that the union of all functions corresponding 
to the arrows in G* with source i and target e is a function. This function is the 
one computed by P i.e. 

the meaning of P = I I F*{p) . 

p^Horri(^* (i,e) 

A comprehensive description of the above construction can be found in jS] . 

From the discussion above it is easy to see that any area with one incoming 
and one outgoing arrow comprises itself a graph-program (a subprogram of the 
original one). Therefore, any such area can be encapsulated to an arrow and 
supplied with the corresponding meaning (see FigQ]d,e). Notice that the initial 
and the final nodes are always considered as external to any area. 

Polynets together with polygraphs used as a foundation can give more expres- 
sive power and flexibility to a graphical computational environment. In particular 
they inherently incorporate parallelism. 

Definition 4 A net-program scheme (N, S, T) is a polynet N with distin- 
guished a set of initial nodes S and a set affinal nodes T such that: 

— every node is reachable from at least one initial node; 

— every node which is not a target of an arrow is an initial node and every 
node which is not a source of an arrow is a target node; 
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— for every two paths which have common target nodes one is part of the other 
or they are alternative (in the sense that they have a common initial part 
which might he empty and then split through alternative arrows). 

Figj2t shows a simple net-program scheme. 




Fig. 5. 



To interpret a net program scheme (N, S, T) it is necessary to accommodate 
its arrows to the structure of the multidomain functions which are usually de- 
fined on products. For this purpose we first associate a polygraph N* with the 
net-program scheme by considering the free monoid of its nodes and then by 
associating objects of the free monoid with the arrows as their sources or targets 
and also with the initial nodes and the final nodes. This must be done in such a 
way that the domain or codomain of an arrow is a tuple of its source or target 
nodes respectively. Then, the initial object of N* is a tuple of the nodes in S 
and the final object of N* is a tuple of the nodes in T. Such association can 
be done in different ways. Once N* is obtained an interpretation of (N, S', T) 
can be considered as a polygraph morphism F : N* ^ VCF from the poly- 
graph to the polycategory of computable functions, such that the functions 
which correspond to alternative arrows have different domains of definition on 
the subdomains of their common nodes. Thus formally: 

Definition 5 A net-program is a tuple (N*,S^,T^,F) where S^,T^ are ob- 
jects o / (the initial and the final objects respectively) and F : — > VCF is 

a polygraph morphism to the polycategory of computable functions fulfilling the 
’’alternative arrows requirement”. 

Fig0shows a simple net-program for computing of a’^b’^. 

The operational semantics of a net program can be given at this stage by 
following some rules for the control of the computational process. These rules 
can be described as a token game which though similar in spirit is quite different 
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from the token game used in Petri Nets Theory. We distinguish two types of 
tokens - active and passive. A node can contain at most one token. 

• At the beginning (step 0) all initial nodes contain active tokens; 

• At step t each arrow not alternative to another, and such that all its source 
nodes contain a token at least one of which is active, is activated. In alter- 
native situations just one of the possible alternative arrows which fulfil the 
above condition is activated. An activated arrow changes its source tokens 
to passive and produces active tokens in its target nodes. 

• The token flow stops when there is no possibility for it to be continued. 

A computation goes along with the above rules by computing at each step the 
values of the functions corresponding to the activated arrows for the available 
data. In case of alternative arrows the one which represents the function defined 
for the available data must be chosen. 

The denotational semantics of a net program can be given in categorical 
manner if the underlying net fulfills additional restrictions necessary to accom- 
modate it to the expressive abilities of the arrow composition operation in a 
polycategory. The transition from N to N* must keep completely all aspects of 
the connectivity of the arrows of N. In essence, it must be such that 

• the common sources (targets) of any two arrows form a subtuple of their 
domains or codomains respectively, 

• the common sources (targets) of any arrow with the initial or final object 
form its subtuple respectively. 

For example, these requirements can be met if N is planar. Under this con- 
dition the semantics of a net program P =(N, S^, T^, F) can be given in a way 
similar to the one considered for graph programs. 

the meaning of P = F*(p) 

'^peHonij^,{So,To) ^ ' 

where p are arrows in the polycategory N* freely generated by N* (see [Zj for de- 
tails) and 5*, T* are the initial and the final object. A more thorough description 
of this construction can be found in jSj . 

The encapsulation resources of the net-programs are much better than those 
of the other considered cases. The definitions of a net-program scheme and a net- 
program show immediately that any area with one outgoing arrow (initial and 
final nodes are always considered as external to any area) comprises a net pro- 
gram and therefore can be encapsulated to an arrow associated with appropriate 
semantics (Figj^tjd). 

3 Remarks on the Implementation 

The discussion in the previous sections gives enough formal background for prac- 
tical design of a graph-based computational environment. 



A Graph-Based Computational Environment 211 



A geometric polyarrow 



ISCoi j$Co' 




a) 



Nodes structure 




Tag ][ Type ] b) 



Arrows structure 



BegCoord 


EndCoord 


SourcePtr 


Name 


Next 


TargetPtr 


SemantlcPtr 



tec,, 






iSC|!.;| ■ SCj | [~ 



C) 



to. 



»- >1 itc,, I • • -j: ltci. 



304 ]&SH 



Fig. 6. 



Since graphs can be considered as a partial case of polynets, for universal- 
ity, we prefer the last as a syntactical basis. A data structure for the internal 
implementation of the environment on a conventional computer must embrace si- 
multaneously the parameters of the geometric picture (which are not important 
for the computational process) and the graph structure. The standard imple- 
mentations of graphs are not appropriate for our purpose. Suppose that in its 
most coarse form an arrow (FigEk) is pictured as several polylines attached to a 
geometric shape used as a core. A simple shape of the core can be a line segment 
(the thick line in the picture). Then, a possible data structure which can incor- 
porate the desired information is represented in Fig0D,c. It consists of a linked 
list of nodes and a linked list of arrows. Each node, in addition to its name and 
coordinates, carries semantic information in the form of a tag-type linked list (we 
allow a node to represent a tuple of domains). Each arrow contains information 
about its sources and targets recognized by their coordinates. It also contains 
information about the parameters of its geometric core and coordinates of the 
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intermediate points for the outgoing or incoming geometric arrows in the form 
of properly organized linked lists. A semantic pointer is supposed to associate 
proper computations with an arrow. The fields which are not necessary under 
some of the interpretations (like the semantic fields of the nodes in the Mealy 
automata case) are ignored. 

4 Conclusions 

In the previous sections we have considered some possibilities for designing of 
a graph-based computational environment. It can be seen that polynets are the 
necessary formal basis if a more complete encapsulation property is desired. The 
creation of a real graph-based programming language is in the favorable situ- 
ation of existence of solid theoretical foundations and technological premisses. 
Nevertheless, one can hardly expect that a graphical programming language, 
the way it is considered here, can be a substitute for an ordinary programming 
language. If the arrows represent primitive operations the graph of a serious 
program will be too big to be comfortably observed. In addition a proper mech- 
anism for recursive definitions is still missing. On the other hand, the graphical 
approach can give a better description (from the human point of view) of the 
global structure of a program. Therefore, it is more realistic to consider and 
design graph-based programming languages as extensions to the conventional 
programming languages. The arrows can represent complex computations de- 
scribed in a conventional manner and a graph program can generate an ordinary 
program which can be processed further by an ordinary compiler. On the other 
hand graph based computational languages can be very convenient and fruitful 
in areas where recursion mechanisms are not necessary. 
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Abstract. In this article we want to discuss the design patterns used 
and proposed for the realization of finite state automata. Various as- 
pects in the design of a framework for the implementation of FSA will 
be treated, presenting not only the patterns for the single components, 
but the entire system design. Using design patterns to sketch a frame- 
work means performing an ’’abstract implementation”, from which it is 
possible to realize concrete specific automata, simply customizing some 
classes. In order to test the framework, some concrete lexical tools have 
been created. The resulting automata and transducers are used to per- 
form word form analysis, word form generation, creation and derivation 
history, spellchecking and phrase recognition. 



1 Introduction 

Finite-state machines and transducers are used in many domains. They allow to 
perform many operations in an efficient and elegant way. Recently, a number of 
general-purpose class libraries have been developed at universities and companies 
(El)- Purpose of this article is not to present a new toolkit providing implemen- 
tations of all the known algorithms to optimize the construction and the use of 
automata. Our purpose is instead to propose and discuss a general design to be 
used while implementing finite-state automata and extensions of them, such as 
transducers. From this general object-oriented design, which makes use of object- 
oriented concepts, without introducing any dependency from any programming 
language, we have realized a general-purpose framework for finite-state systems 
implementation, written in C-|— 1-. Customizing the framework has brought to 
the realization of different lexical automata. In a certain way the realization of 
the object-oriented framework has been used as test for the overall design that 
we will present in this article, whereas the single concrete lexical automata have 
represented a test for the framework’s customization. First of all we must point 
out what the meaning of a framework is. According to |3j, a framework is more 
than a simple toolkit. It is a set of collaborating classes that make up a reusable 
design for a specific class of software. The purpose of the framework is to define 
the overall structure of an application, its partitioning into classes and objects, 
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their collaboration, and, most important, the thread of control. These prede- 
fined design parameters allows the programmer to concentrate on the specifics 
of his application. He will customize the framework for a particular application 
by creating application specific subclasses of classes (eventually abstract) from 
the framework. The framework itself can be viewed as an abstract finite-state 
element. Only the definition of some concrete classes can generate from it a us- 
able automaton. The main design decisions have therefore already been taken, 
and the applications (finite-state elements) are faster to implement and even 
easier to maintain. Even though the framework has been fully implemented (|^) 
and its customization has allowed the realization of many different lexical tools 
([^), also available as client/server through a Web page, the aim of this article 
is its discussion from a software design point of view, in order to allow reuse 
from a higher point of view. There are five main components of the framework 
which will have to be customized. For each of those five parts we have used an 
existent design pattern (P). We also suggest a pattern for a sixth part (exter- 
nal input), which, however, we have not implemented as customizable. For the 
input/output component we have used the Adapter, for the FSA nodes we have 
used the Template, for the traversing algorithm the Strategy, for the action to 
be performed at each node a simplified version of Visitor and, finally, to locally 
manage the instantiation of each single customizable element, we have used the 
Factory design pattern. 



2 Input/Output: Adapter 

The internal structure of a concrete FSA can be printed in different forms. We 
use a text readable form, when we want to have the opportunity to edit it, in 
order to understand the internal elements, data types and links of the realized 
automaton, to eventually allow debugging. We use a crypted form when we want 
to give our tools for public use. These two kinds of output (and corresponding 
input) treat the data in different manners and have different interfaces. What we 
need is a common adapted interface which allows the replacement of the concrete 
input/output device (object) at any time and without consequences for the rest 
of the program. We need a wrapper. As sketched in fig. [Q the Adapter design 
pattern helps realizing a wrapper converting the interface of a class into another 
interface clients expect. The framework must foresee a mechanism to allow the 
adaptation of the output, without breaking the interface used. 

The new input/output will be wrapped into a new class, which not only per- 
forms a call further to the corresponding method of the wrapped object, but also 
executes some adapting operations on the data. The tasks to be performed in our 
crypted input/output class are very different from the ones used in the textual 
output. The adapting operations of the former has caused the implementation 
of a circular buffer used to store a certain amount of temporary data, waiting, 
in case of output, to be transformed into the desired crypted data, and, in case 
of input, to be loaded into the internal structure. 
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Fig. 1. Adapter interface 



3 Node: Template 

Each finite-state tool can have a different kind of node, depending on the kind 
of information it must code and on its use, unidirectional or bi-directional. The 
opportunity to define a new node represents therefore a level of customization. 
The new kind of node should take advantage of the existing managing algo- 
rithms, using them as they are, without further modifications. There are two 
methods for realizing such a design: parameterized types and common classes. 
Parameterized types, even though in general more efficient, is however a concept 
which is not known in every programming language, and this would restrict the 
generalization of our software design, which is intended to be independent of 
any programming language. Because of this and because we have judged the 
second method more flexible, we have chosen the second one. Notice that the 
method used is called Template in the design patterns terminology. This is not 
to be confused with the C-|— I- template mechanism, which represents the C-I--I- 
way of realizing the parameterized types. The abstract Node class (fig. E|) must 
define the interface, previewing all basic functionalities required for the nodes by 
the internal algorithms. The latter will use the concrete elements through Node 
references. 




Fig. 2. Node hierarchy 
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4 Traversing the FSA: Strategy 

A transducer used to generate word forms is not deterministic and needs to be 
traversed in a non deterministic way, an automaton which is known a priori to 
be deterministic, could choose to use an easier traversing algorithm, which do 
not looks for all search paths. The looping over nodes is also an opportunity 
that could be switched off, for some tools, for efficiency purposes. As we can 
see there are many algorithms to consider for traversing finite-state elements. 
Hard-wiring all of them into the class that may require them is not desirable. 
First because the class will get more complex if it has to include all possible 
algorithms, and different algorithms will be appropriate at different times; second 
because it will become difficult to add new algorithms or to vary existing ones 
when traversal is an integral part of the class that uses it. We can avoid these 
problems encapsulating all different traversing algorithms in different classes, 
using the same interface, as shown in fig. 0 The interface is defined by a common 
superclass, the Strategy class. The intent of the Strategy pattern is to define a 
family of algorithms, encapsulate each one, and make them interchangeable. 




Fig. 3. Strategy 



5 Get Information: Visitor 

The main feature here is the separation of information extraction performed 
during the traversal process, from the traversal algorithm itself. We must keep 
the responsibility of the action away from the traversal part. In this way we can 
use the same finite-state tool to deliver a different type of information. We need 
a callback mechanism able to store intermediate results. We had two models: 
Command and Visitor to describe our design and implementation. We used the 
Visitor because it seemed nearer to our purposes, even though we used a simpli- 
fied version of it, because the visitor is thought for a use with different kinds of 
nodes at the same time, which is not our case. But the meaning is the same: an 
external object is used to access and read the data of the internal structure. The 
information extraction process is embedded into the abstract class Visitor. Sub- 
classing it means reusing the nodes of the finite-state system, building with it a 
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new kind of answer. During the retrieval process the internal data in the nodes 
remains read-only, i.e. unmodified. The adaptation is in the way the data will be 
used for the external result. For example, the difference between the information 
extracted from a lemmatizer and the information extracted from a morphosyn- 
tactic analyzer can be coded uniquely distinguishing two different interpretations 
of the same data, i.e. modifying the action performed during the traversal. The 
traversal process is responsible for leading the control through the structure, 
whereas the action, which will be called for each node, involves accumulating 
information during it. This is particularly useful with lexical transducers, which 
store input and output information in the nodes. Separating the retrieval pro- 
cess from the internal structure will bring more flexibility and potential for reuse, 
because different kinds of retrieval often require the same kind of traversal. In 
addition, we will simplify the task of customizing the retrieval, restricting the 
modification to the action. The implementation is organized as follows: an object 
of the class Strategy is responsible to traverse the internal structure and has a 
reference to a Visitor object. During the traversal, each node of the structure 
will receive the visit of the instantiated Visitor object. The instance is used for 
accumulating information, creating the final result of the analysis. The overall 
pattern is shown in fig. 0 The abstract class Visitor is shown with two (among 
many possible) inheriting concrete classes. 




Fig. 4. Visitor 



Any FSA internal specific data structure remains separated and hidden for 
the visitor object, simplifying the task of the customer. 

6 Custom Instantiation: Factory 

Working with abstract classes to allow customization causes a typical problem 
during the instantiation of the concrete classes. The framework uses the abstract 
classes to define and maintain relationships between objects and is also respon- 
sible for creating them. But the corresponding concrete classes are defined later, 
because they are application-specific. The framework must instantiate classes, 
but it only knows their abstract definition, and cannot predict the concrete sub- 
classes, for example, of Node and Visitor to instantiate. The solution to this 
dilemma is offered by the virtual constructor, also classified as Factory method 
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in design patterns. This method takes advantage of the fact that the framework 
does not know the concrete classes to instantiate, but knows exactly when their 
creators should be applied. It moves out of the framework the responsibility to 
instantiate concrete classes. The main framework class (in our case the Fsa class) 
defines the four pure virtual operations createNode(), which will return a Node 
reference, createVisitor(), which will return a Visitor reference, createStrategy(), 
which will return a Strategy reference, and createFileWrapper(), which will de- 
liver the right input /output Adapter object. Those operations will be redefined 
subclassing Fsa. The real implementation of the methods will then be part of 
the subclass of Fsa, which will instantiate the concrete classes (fig. |^). 




Fig. 5. Factory method 



The factory method is good for our system, because it eliminates the need to 
bind domain-specific classes into the overall framework code, enhancing therefore 
the customization and the reusability of the framework itself. 

7 External Input: Decorator 

There is another kind of input, the external input, i.e. the collection of data com- 
ing from outside, which must be converted into the internal finite-state structure. 
Differently from some existing finite-state tools, which need a regular grammar 
description as input, our framework is based on an extended input, from which 
it will extract the regular expressions’ mechanism to build the automata. We 
consider as input each single sequence of symbols which has to be recognized 
and retrieved in the building FSA. Our external input module is at the moment 
only partly customizable. The system itself is responsible of transforming and 
optimizing the whole input in a minimal FSA. We know, on the other hand, 
that other finite-state tools are built having as input other kinds of sequences, 
regular expressions, continuation classes (0), etc. It seems therefore useful to 
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define a filter module for each kind of input, in order to convert it to the type 
requested by the framework. The conversion could be considered an operation 
to be applied outside (in this case it would not be part of the overall design) 
or inside the framework. The easiest way of realizing it inside could be through 
the use of the Decorator pattern, which would guarantee the complete reuse of 
the main existent input function and, at the same time, the transparency of the 
operation. 

8 Conclusions 

We have shown the software design of a framework representing an abstract 
finite-state element, which can be easily customized to produce new kinds of con- 
crete finite-state tools. A real framework has been realized (using the C-|— I- lan- 
guage) applying this design, and from this framework various finite-state concrete 
elements have been implemented. The functionalities of the single elements have 
been tested and demonstrated (|0|), and can be shown through a client/server 
demo version on the Web at the following address: http: / /www. wordmanager.com. 
But the significance of these tools does not reside primarily in their individual 
functionalities. Their principal interest lies in the fact that they can be produced 
with so little effort on the basis of an existing object-oriented customizable frame- 
work. 
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Abstract. This paper is about the recognition of finite and infinite 
words with at least two factorizations on a finite language. We imple- 
ment the construction of the graph of delays for a finite language L which 
can be view as an automaton that accept words and cu-words that are 
ambiguously covered by L. 



1 Introduction 



The goal of this software is to recognize words which have at least two factor- 
izations on a finite language L for the power-* and the power-w(see lEEl) of this 
language. A word a is ambiguously covered by a language L if there is at least 
two factorizations of a with different first factors in Ll |Kar85l| l. To recognize 
those words, we use a graph inspired by J.Karhumaki in |Kar85j called graph of 
delays. He used this graph to visualize the two factorizations of an infinite word 
in the case of a language generated by a code with three elements. 

We use graphs of delays to classify different samples of finites generators of 
w-languages by the number and the representation of factorizations of words 
ambiguously covered by the given language. The aim is to decide if, for a given 
finite language L, there exist a code (or an w-code) C such that L‘^ = C'^ . We 
already know that for a code C, such that C'*' is the greatest generator of C“, 
there is no w-code as tn-generator [.l bP!7n| 1.111196*1 and the graph of delays allows 
us to decide if a finite language is a code and an w-code or not. 

None of the existing 



allow us to do this construc- 
tion, so we have written a software in Java which implements the graph of delays, 
draws it and uses it as an automaton to recognize words with at least two fac- 
torizations on a finite language. 

Our program allows us to construct this graph of delays for a finite language 
and to use it as a non-deterministic automaton to recognize the set of finite and 
infinite words ambiguously covered by the language L (and then to decide if L 
is a code or an w-code). This graph of delays is very close in the spirit to the 
construction of the test of Sardinas and Paterson to decide if a language is a 
codef 
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2 Preliminaries 

Let A" be a finite alphabet. The set E* is the set of all finite words over E 
and E‘^ is the set of all infinite words. The empty word is denoted by e and 
E^ = E* \ {e}. Let L C E* be a finite language, then L* is the set of finite 
words obtained by the finite concatenation of words of L, and is the set of 
infinite words (also called w-words) obtained by infinite concatenation. 

A factorization over L of a word it G L* is a sequence (iti, it2) • • ■ , Ufc) of words 
of L such as u = U\U2 ■ ■ - Uk- A factorization over L of a word a G is an infi- 
nite sequence (oi , 02, ■ • ■ , Ofc, ■ • ■ ) of words of L such as a = 0102 ■ ■ .at ■ ■ ■ ■ 
We say that a word u has two factorizations over a language L (or u is ambigu- 
ously covered by L) if u has two factorizations with different first words in L. 
We denote by pref(L) (resp. pref(it)) the set of prefixes of words of L (resp. the 
set of prefixes of the word u £ L). 

A language C is a code^EHS] iff each finite word of C* has only one factorization 
over C, and we say that it is an w-code iff each infinite word of C“ has only one 
factorization over C . 

In all the paper, we consider words with two factorizations with none com- 
mon factors, for example with the language L — {a, abac,ba, ca}, the word 
baabacacaca has two factorizations (see FigCJ but the first factors and the two 
last ones are the same in both. In such a word, only the factorization of abaca 
is really ambiguous. 




Fig. 1 . 



3 The Graph of Delay 

Given a finite language, we define the graph of delays as an oriented graph whose 
set of vertices is the set of delays between two factorizations, i.e. labels of vertices 
are advances of one factorization on an other for the prefix of a given ambiguous 
word. Intuitively a labeled edge fits with the closest prefix of one of the two 
factorizations from the current read prefix. The delay e means either that we are 
at the beginning of a word or that we have two factorizations of a finite word. 
Thus we can decide if a language L is a code and not an co-code , or if this 
language is an co-code . 

Definition 1 Let L he a language. We call graph of delays the pair (V, E) such 
that: 

— V is the set of vertices 
V = (L+)“^L+ n pref{L) 
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— E is the set of oriented edges 

E = {(u,l,v) G V X X V/{uv G L and I = v) or {I G L,ul = v G 
pref{L)) or (v = e,l G L,ul G L)} 

(u,l,v) is an edge labeled by I from a vertex labeled by u to one labeled by v. 

The following algorithm construct the graph of delays, 
let L be a finite language, V and E are empty set. 

— for X and y in L 

- if X is a prefix of y then 

- add X in y 

- add (e, x, x) in E the edge from e to x labeled by x. 

— for V gV 

- for p the prefix of a word of L 

- if vp G L 

— if p then add p in V 

— add (v,p,p) in E the edge from v to p labeled by p. 

— if p G L then add (u,p, e) in E the edge from u to e labeled by p 
and L is not a code. 

- for p in L 

- if vp is the prefix of a word of L then 

— if vp then add vp in V 

— add (v,p, vp) in E the edge from v to vp labeled by p. 

- if vp is a word of L then add {v,p, e) in E the edge from u to e labeled 
by p and L is not a code. 

Then we define the automaton A= {Q, e, T, 6) by 

the set of states Q = V 
the initial state e 

the set of final states whish is T = 

{e} for finite words and T=V for infinite ones 
and S the function of transition defined from E by for each (u, l,v) G E 

S{u, 1) = V 

For infinite words, the recognition is done by the Biichi condition of acceptance 

of w-languagesjPQ. 

Proposition 1 The automaton A recognizes : 

— the set of words of with at least two factorizations, with T = {e} 

— the set of words of with at least two factorizations, with T = V 

Proof. By the construction done by the algorithm, all words recognized by the 
automaton A have two factorizations. 

Let us consider a such a word not recognized by A. Let v ^ V he & delay of 
one factorization on the other and v' gV it previous delay. There are only two 



cases: 
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— V £ pref(L) and vv' £ L then v £ V. 

— V £ pref(L), v' £ pref(f) and v'~^v £ L then v £V . 

then A accept the word a. □ 

This automaton can be non-deterministic and ambiguous (i.e one word can 
be accepted by two path in the automaton). 

So the two following results hold : 

Proposition 2 Let L he a language, G = (V, E) the assoeiated graph. L is a 
eode iff there is no eycle in the graph G that eomes baek to the vertex labeled by 
e. 

Proof. If L is not a code then there is a finite word u with two factorizations and 
the final delay between the two factorizations is e. The graph contains a cycle 
that comes back to the vertex labeled by e. 

Now, if the graph has a cycle path which begins by the vertex labeled by e and 
comes back to this vertex, then there is a finite word with two factorizations and 
then, the language L is not a code. □ 



Proposition 3 Let L be a language, G = (V, E) the associated graph. L is an 
uj-code iff there is no cycle in G. 

Proof. If there is a cycle then we can read an infinite number of label of edges 
which represent an infinite word with two factorizations, so the language L is 
not an w-code. 

If L is not an w-code then there is at least one infinite word in with two 
factorizations on L. We can read an infinite sequence of labels I (of edges between 
two vertices). Sets V and E are finite because the language L is finite. Then to 
read an infinite sequence of I in the graph, we must reach more than one time 
some vertices and then there is a cycle in the graph. □ 

Example 1. Let L = {a,abac,ba,ca} a language. Let’s consider languages L* 
and 

— The set of finite words with two factorizations is a{baca)~^ 

— The set of infinite words with two factorizations is (a{baca)'^)* {{abac)‘^ U 
{abaca)'^) 

The graph of the figure El represents all those words. 

In this automaton we read only words with at least two factorizations and the 
label of a reaching vertex is the advance of one factorization on the other. 
When we read a word with two factorizations, at the beginning the delay of one 
factorization on the other is e. 

If on one factorization we read a the delay is a: 
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So if we read ba, the factorization continues with aba and the delay is aba: 

a b a . 



Now, reading c, the upper factorization begins with abac and the delay is c: 

b a , 



Then the lower factorization can continue with ca and the delay is a: 




Here, if in the upper factorization we read a, we have a word of L’*' with two 
factorizations. But, we can continue by reading ba in the lower factorization. 

So we can read all the finite words with two factorizations. Those words of L+ 
are represented by a finite path in the graph beginning and finishing by e. 

The infinite word with two factorizations is represented by the infinite path 
beginning by e and infinitely crossing vertices labeled a, aba, c, and e. 




Fig. 2. The automaton for the language {a, abac, ba, ca} 



4 The Software 

The Implementation 

The software is written in Java 1.1. We implement the automaton as a graph 
by adjacency list. We construct the graph using the algorithm. During the con- 
struction of the graph, we decide if the given language is a code, and an w-code 
(if an edge come back to the vertex labeled by e then the language is not a code, 
and if an edge come back to a vertex previously calculated then the language is 
not an w-code. 



The User Interface 

The software has graphical interface. It needs a finite language as a string [e.g. 
“a,abac,ba,ca”) which is given in a text field, and then draws the automaton 
and checks if the language is a code or an w-code. There is an other text field to 
enter a word, also as a string {e.g “a(baca)A2 ”), this to decide if the word has 
more than one factorization on the given language. 
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Abstract. Autographe is a graphical version of Automate, a system 
for symbolic computation on automata and semigroups. Autographe is 
developed upon Egg, a graphical editor of graphs and a library of Java 
1.1 classes aimed at the graphical treatment of oriented graphs, their 
vertices and edges and associated informations. Communication between 
the Automate server and the Egg editor is realized via sockets. 



1 Introduction 

Autographe is a system for symbolic computation on automata with graphical 
edition based upon a client/server- type collaboration between the two following 
programs : 

- Automate, a system for manipulating automata and semigroups developed 
by J.-M. Champarnaud and G. Hansel 0. 

- Egg, a graphical editor for graphs developed by A. Uppman HH. 



2 The Automate Program 

The Automate application lets you perform all the classical operations on regular 
expressions, automata and semigroups. It has been modified in order to play the 
role of a computation server for the Egg editor. 

3 The Egg Editor 

At the first. Egg (as ’’Editeur graphique de graphes”) was developed as a pro- 
gramming exercice in C-|— I- and Motif P2| and then in Java 1.0. Later on. Egg 
evolved into a library of Java 1.1 classes. This library offers classes that let you 
manage graphs through graphical and interactive views. Information may be 
attached to the vertices and the edges. Egg has actually been used to build a 
userfriendly interface to Automate. Communication between the two is done by 
sockets. 

* This work is a contribution to the Automate software development project car- 
ried on by A. I. A. Working Group (Algorithmics and Implementation of Automata), 
L.I.F.A.R. Contact: {Champarnaud, Ziadi} @dir. univ-rouen.fr. 



J.-M. Champarnaud, D. Maurel, D. Ziadi (Eds.): WIA’98, LNCS 1660, pp. 226 -^ 7 ^ 1999. 
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3.1 Principles 

If you define a graph as a set of vertices and a set of edges joining these vertices, 
then notions like: 

— the position of vertices, 

— the graphical form of vertices and edges, 

— the information carried by edges and vertices 

are absent. 

The principles underlying the library Egg are then the following : 

— the library should offer the basical classes and handle the most complex 
problems arising when you try to draw vertices and edges, 

— inheritance should be used to adapt these classes to any type of information 
attached to vertices or edges, 

— any visual representation of a graph (form, position) is arbitrary in the sense 
that it is one representation among many possible ones (thus several ’’views” 
on a single graph should be allowed). 

3.2 Contents 

The Egg library primarily tries to conform to the principles stated above. As a 
consequence data structures are distributed on several classes and often coded 
by java. util. Vector or java. util. Hashtable. In short : we do not privilege time or 
space optimisation. 

Basical classes are Graph, Vertex and Edge. Their implementation is inspired 
by P|. These classes don’t care with any screen drawing. The graphical drawing is 
devoted to the Point and ArticulatedLine classes and their descendants. Among 
these classes. Point Vertex and LineEdge respectively point to a vertex and to an 
edge (notice that the vertex and the edge are not aware of this) . 

Usually a Point Vertex or a LineEdge is defined in an instance of the View 
class. Any number of views may be defined on the same graph. The view does 
not contain any information apart from form and position attributes. Any mod- 
ification of the graph or its information (other than form and position) is thus 
transmitted to all of the views of the graph : thus the edition of the graph may 
be done in any view. 

When a graph is given without any information on the positions of the ver- 
tices it’s always a tedious task to position them by hand. Egg offers the Placer 
class to perform this task. Various placement algorithms are ready for use. 



4 Functionalities of Autographe 

As we have seen the user interface of the Autographe client-server system pro- 
vides a graph editor built on the Egg library and a text editor for regular ex- 
pressions at the client side, and the Automate program at the server side. 
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A typical Autographe use case is as follows. The user edits a regular expres- 
sion and sends it to the server to get an automaton that recognizes the language 
of the expression. A standard representation of the automaton is presented by 
Autographe in a view. The user may display other non-standard views and edit 
graphically one or several views. The automaton as well as the views may be 
saved on disk. 

Many other functionalities concerning edition of graphs (or views) are present. 
However, in this release many important functionalities are yet absent. For ex- 
ample, most of the functionalities of the Automate server are not available, but 
will be in next release. 

We are also waiting for more user feedback to improve ergonomy at the client 
side. 



5 Perspectives and Related Works 

The graphical interface to the Automate command is in progress, offering the 
possibility of the graphical edition of an automaton. The graphical interface to 
the command Monoide is to come, offering the visualization of the results of 
this command. The class architecture of the Egg library will also easen the con- 
struction of a graphical interface for SEA0 software which is made of the Maple 
packages AG j0|, MULT 0, and TROP |S|. 

The editor of the FLAP jSj system allows to construct in a window a finite 
automaton, a stack automaton or a Turing machine. The FSGraph editor of 
the toolbox INTEX [101 allows to construct finite state transducers. The Egg 
editor makes available the same type of functionalities for finite state automata. 
In addition it allows to display an automaton from its transition table. This 
possibility is present in systems which provide animation for string matching 
algorithms as 0 or Padnon 0. Egg uses a state placement algorithm which is 
similar to the one of the AMoRE system 0. 
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Abstract. INTEX is a linguistic development environment that allows 
users to build large-coverage finite state descriptions of natural languages 
and apply them to large texts in real time. INTEX represents texts, 
grammars and dictionaries by Finite State Transducers. 



1 Introduction 

INTEX is a linguistic development environment that allows users to build large- 
coverage finite state descriptions of natural languages and apply them to large 
texts (several dozen million words) in real time. 

One important aspect of INTEX is that grammars, dictionaries, as well as 
texts are all represented by Finite State Transducers (FSTs). Therefore, all the 
operations users perform via the graphical interface are translated into elemen- 
tary operations on FSTs. For instance, applying a set of dictionaries to a text 
is performed by constructing a union of the dictionaries’ FSTs, and then apply- 
ing the resulting FST to the text FST; removing lexical ambiguities in the text 
is performed by computing the intersection between a grammar’s FST and the 
text’s FST, etc. 

INTEX includes a hundred tools in 3 areas: 

— tools that process texts: indexing a text, checking its format, segmenting 
it into text units (e.g. sentences), normalizing special words (e.g. elisions or 
contractions) or spelling variants, indexing sequences that match a FST, con- 
structing the resulting highlighted text or concordance, extracting text units 
that contain some (or no) matching sequences, applying statistical tools to 
study the vocabulary, the frequency, the coverage and the evolution of the 
matching sequences, displaying the FST that represents the text, etc. (these 
tools are available under the Menu Text); 

— tools that process dictionaries: checking the spelling of the entries of a dic- 
tionary, the correctness of its codes, as well as its format described by a 
FSA, sorting a dictionary, inflecting a DELAS-type dictionary, compacting 
a DELAF-type dictionary into a FST (Menu Dictionaries); 
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— tools that process grammars: the graphical editor, compiling large sets of 
graphs into FSTs, formatting graphs, etc. (Menu FSGraph) 



2 Applying INTEX FSTs to Texts 

FSTs may be applied to texts in order to modify them in three modes: 



— in REPLACE mode, matching sequences are replaced with the correspond- 
ing FST output; this feature is similar to the Search & Replace operations 
of the regular word processors; 

— in MERGE mode, FST outputs corresponding to matching sequences are 
inserted into the text; each inserted output sequence is inserted before the 
corresponding input sequence; 

— in DISAMBIGUATING mode, FST outputs consist of sequences of lexical 
constraints that must be applied to the corresponding lexical entries of the 
FST input. The result is a partially tagged text; 

When applying FSTs to modify a text, it is important to set some priority 
Rules. 



2.1 FSTs Are Applied from Left to Right 

For instance, consider the text (^1 and the FST of Figured if we apply the FST 
of Figure dto the text o, we get the results (0 or ®. 

z a b c d e z (1) 




Y 



Fig. 1. 
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in REPLACE mode: 



z X e z (2) 

in MERGE mode: 

zabcXdez (3) 

In other words, the text was modified before the sequence b c d e was recog- 
nized. 

There is a special case for this rule: sequences that start with the special 
symbol < A > (beginning of text unit). In INTEX, text units are either: 

— lines or paragraphs delimited by the character NEWLINE, 

— concordance entries, i.e. 3-column lines, where the center column contains 
previously indexed sequences, and the left and right columns are length- fixed 
contexts of the utterances; 

— any text unit delimited by the special mark {S}; usually, this mark has been 
introduced in the text with a special FST that identifies sentences. 

The special symbol < A > always have priority over sequences that do not 
start in this way; for instance, if we apply the FST of Figure |21 to a text, the 
sequences a b that occur at the beginning of text units will be replaced with Y, 
while sequences that occur elsewhere in the text unit will be replaced with X. 
In other words, at the beginning of text units, the matching sequence < A > o 
b has priority over the matching sequence a b. 




Fig. 2. 
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Y 



Fig. 3. 



2.2 Longest Matches Have Priority over Shorter Ones 

For instance, we apply the FST of Figure El to the text dU- 

We get the same results O and O because the matching sequence abed 
has priority over the shorter one a b c. This priority rule is natural for the disam- 
biguation process, where longest matching sequences are better disambiguated 
than shorter ones; this allows also linguists to write default disambiguation rules, 
such as in the FST of Figure El 




Fig. 4. 



If s’ is followed by il or Us, it is a conjunction (=sf); otherwise, it is a pro- 
noun (=se). In the first case, the matching sequence s’il has priority over the 
other matching sequence s 



2.3 Ambiguous FSTs Produce an Undefined Result 

An ambiguous FST is a FST that associates one matching sequence with more 
than one output sequence. 



234 



Max Silberztein 



Note that this problem only occurs when one applies an ambiguous FST to 
a text which is represented in a linear way (e.g. ASCII text). When the text is 
itself represented by a FST, applying ambiguous FSTs is perfectly valid and it 
will lead to the creation of parallel paths in the text FST. 



2.4 e-FSTs Cannot Be Applied to Texts 

An e-FSTs is a FST that recognizes the empty string. 

3 A Walkthrough 

Now I will comment on several screen shots that correspond to important stages 
of text parsing: 



1. Processing an ASCII-ISO ’raw’ text in order to prepare it for further linguis- 
tic analysis. Generally, this preprocessing stage consists of three application 
of FSTs to the text: 

— a FST used in MERGE mode to insert text unit delimiters (usually 
sentence delimiters), 

— a lexical FST used in REPLACE mode to tag unambiguous compound 
words, 

~ a FST used in REPLACE mode to solve elisions and contractions (such 
as can’t for can not). 

2. Identifying words in the text consists of applying selected dictionaries and 
morphological grammars to the text. Linguistic words are either simple words 
(lexical entries that are sequences of letters) or compound words (lexical en- 
tries that contain more than one simple word). Dictionaries are in the form 
of DELAFs, that is, their entries are inflected forms (e.g. eaten), and are 
associated with both a corresponding lemma (e.g. eat) and some kind of 
information (e.g. verb, participle). Morphological grammars are graphs that 
identify families of (potentially infinite) form variants and produce both a 
corresponding canonical form (a lemma, or a standard keyword for ortho- 
graphic variants or synonymous expressions) and some kind of linguistic 
information. 

3. Generally, consultation of the dictionaries produces more than one lexical 
entry for each form in the text. The next step is to remove these ambigui- 
ties by applying local grammars in the form of FSTs that associate certain 
matching ambiguous sequences of forms with some lexical constraints. These 
constraints are in turn applied to the forms during consultation of the dic- 
tionaries in order to destroy irrelevant lexical hypotheses. 
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4. Applying a FST to a text to locate matching sequences. The resulting in- 
dex can be used to highlight the text (all matching sequences are colored 
and underlined), to extract all the text units that contain (or do not con- 
tain any) matching sequences, or to display all the matching sequences in 
context (concordance). Matching sequences can be also studied with several 
statistical tools. 

5. Constructing the FSTs that represent each sentence of the text; in this FST, 
inputs are lemma, and outputs are the corresponding lexical information. 
Ambiguities between simple words, or between a compound word and the 
corresponding sequence of simple words are represented by parallel paths. 
This FST is meant to be the input of the INTEX syntactic parser. 



3.1 Preprocessing the Text 



An ASCII text is loaded; the FST Sentence is applied to the text in MERGE 
mode to insert the text unit delimiter {S} between sentences (Figure 0. 




Fig. 5. 
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In this notation, states are hidden (they can be imagined as being at the left 
side of every box). Inputs are written inside boxes; outputs are written below 
boxes. <E> stands for the empty string, <MAJ> stands for any word in up- 
percase letters, <NB> stands for any sequence of digits. Gray nodes are names 
of embedded FSTs; for instance. Upper case char is the name of an embedded 
FST that identifies the 26 letters A...Z. 

The path at the top of the graph matches sequences of a period, an inter- 
rogation mark or an exclamation mark followed by a word in uppercase; the 
symbol {S} is then inserted between the period and the word. Below is a path 
that matches sequences such as Prof. C. Doe. For such sequences, no sentence 
delimiter is inserted. 

FSTs built in INTEX are fully editable and customizable; there is a different 
FST used to identify sentences for each language. 



3.2 Simple and Compound Words 

Lexical FSTs and dictionaries are applied to identify simple and compound 
Words (Figure El) 

After the preprocessing stage, users select dictionaries and FSTs used for the 
recognition of simple and compound words in the text. Dictionaries are in the 
form of a DELAF dictionary, that is, they contain entries as inflected forms (e.g. 
ate) and associate them with a lemma (eat), as well as some kind of linguistic 
information such as a syntactic category ( V) and some inflectional information 
(Preterit). 

FSTs must be used when lexical entries are infinite in number (e.g. numer- 
ical determiners such as two hundred and fifty six); they are also conducive to 
putting together families of orthographic variants, morphological derivations, or 
synonymous expressions. 

Tokens of the text are either sequences of letters (words), sequences of digits 
(numbers), tags (linguistic information written between curly brackets) or de- 
limiters (all the characters that are neither letters nor digits). Tokens are sorted 
by decreasing frequency. 

Users select dictionaries and FSTs to be applied to the text. Users can edit 
dictionaries and FSTs and add their own data. Dictionaries and FSTs are asso- 
ciated with a 3-level priority system that allows hiding or imposing some kind 
of lexical information. 

The result of the consultation of all selected dictionaries is presented below, 
in four windows (Figure [^l. 
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Fig. 6. 



— all the simple words that correspond to one or more lexical entries are pre- 
sented in the top left window, which is associated with corresponding lemma 
and linguistic information. Ambiguous forms are presented on several lines; 

— all the simple words that have not been found in any selected dictionary, 
nor matched any lexical FST are displayed in the right top window. Usually, 
these forms are either spelling errors, proper names, or words specific to the 
domain; 

— the compound words are displayed in the lower two windows. Certain forms 
are a priori ambiguous, such as red tape (either the adjective followed by the 
noun, or the compound noun). Other compounds are not ambiguous, such 
as best-seller or acquired immune deficiency syndrom. 

Users can edit these resulting dictionaries in order to remove standard lexical 
ambiguities that by chance do not occur in the specific text, or to move words 
from the unknown words window to the simple words window, without modify- 
ing the general dictionaries of the system. 
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Fig. 7. 



3.3 Disambiguation of Grammatical Words 

When one Applies local grammars to disambiguate grammatical words in a 
French Text, each local grammar is an FST that recognizes certain text se- 
quences (e.g. Figure IHliZ le la donne) and then applies the corresponding lexical 
constraints (e.g. <PRO> <PRO> <PRO> <V:3s>). All the lexical hypotheses 
that do not match the lexical constraints are then deleted. 

The result displayed above is the tagged text, i.e. the full text in which all 
the disambiguated forms have been replaced by the corresponding tag, which 
appears between curly brackets. For instance, the original text starts with @. 

C’est en Egypte, vers la fin de... (4) 

After the disambiguation process, the text becomes 0 

{C’,ce.PRO-|-PPV:ms} {est,etre.V:P3s} en Egypte, 

vers {la,le.DET:fs} fin {de,.PREP}... (5) 

{C’,ce.PRO+PPV:ms} stands for form C\ corresponding to lemma ce, pro- 
noun, preverbal particle, masculine singular; {est,etre.V:P3s\ stands for form 
est, corresponding to lemma etre, verb, present, 3rd person singular, etc. 
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Fig. 8. 



In a tagged text, all the forms that are ambiguous are left unmodified. It is 
also possible to display the remaining ambiguities by displaying either a regular 
expression or the finite state transducer of the text. 



3.4 Indexing an FST 

— It is possible to index sequences that match a regular expression, such as: 
<be:P> (<ADV> + <E>) going to <V:W> + (will + shall) <V:W> 

all the forms of the word ”be” conjugated in the present tense, followed by 
an optional adverb, followed by ’’going to”, followed by any verb in the in- 
finitive or the word ’’will” or ’’shall” followed by a verb in the infinitive. Any 
user- defined code that appears in a dictionary can automatically be used in 
a symbol (between angles); 

— one can also locate all the entries of a compound word dictionary. In this 
way, technical terms can be indexed and extracted; 
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— one can also use the graphical editor to construct grammars; these graphs 
can be applied to text (Figure 0. 




Fig. 9. 



Above, we have applied a FST that recognizes certain complements of dates 
(over 8000 states in over 100 graphs) to the text. Matching sequences are high- 
lighted in the top left window; the corresponding concordance is displayed in the 
lower right-hand window. 

We can extract all the sentences that contain one or more matches (or no 
match), from the text; several statistical functions can be applied to study the 
density of matches in the text, the evolution of matching sequences throughout 
the text, the vocabulary of matching sequences (i.e. the set of all different match- 
ing sequences that match the FST), etc. 



3.5 Text Representation 

Internally, each text sentence is represented by a FST. For instance, the Text 
Elis represented on Figure nTil with all the ambiguities that remain after having 
applied INTEX default local grammars. 
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II donne la pomme de terre cuite (6) 

Thanks to several local grammars, only the 3rd person singular of the verb 
donner is displayed; the determiner la can be followed by the noun or adjective 
pomme, or the noun pomme de terre, while the pronoun la can only be followed 
by the verb pommer; the form de has also been disambiguated. 

Note that at this stage of analysis, the computer cannot decide if the correct 
interpretation is: 

— He eats a clay apple, (compound noun terre cuite = clay) 

— He eats a cooked potato (compound noun pomme de terre = potato) 



4 INTEX as a Research Tool 

Over 30 research centers are presently using INTEX as a research tool in var- 
ious domains: computational linguistics, corpus-based linguistics, information 
retrieval, terminology, literature studies, and teaching of a second-language. 
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Large-coverage INTEX descriptions have already been built for Bulgarian 
(Prof. Elena Paskaleva, University of Sofia), English, French and Spanish (Prof. 
Maurice Gross, University of Paris 7), German (Prof. Franz Guenthner, Uni- 
versity Maximilian, Munich), Greek (Prof. Ana Symeonides, University of Sa- 
lonique), Italian (Prof. Annibale Elia, University of Salerne), Polish (Prof. Zyg- 
munt Vetulani, University of Gdansk), Portuguese (Prof. Elisabete Ranchodd, 
University of Lisbon), Russian (Alex Kolesov, Academy of Sciences, Moscow). 
INTEX Dictionaries are being constructed for Korean (Jeesun Nam, University 
of Seoul) and Old French (Prof. Hava Bat-Zeev, University of Tel Aviv) . 

See [Silberztein 1996a] for the proceedings of the first INTEX users’ Work- 
shop. 60 participants came from 10 countries; 15 papers demonstrated the various 
applications of the system. 

Two European Projects have used the INTEX environment: 

— to build a computer aided translation workstation (EUROLANG used IN- 
TEX technology and linguistic data to automatically identify terms in texts); 

— to build large-coverage descriptions of Bulgarian, German, Polish and Rus- 
sian (BILEDITA project). 

5 INTEX as an Education Tool 

Describing the linguistic data included in INTEX (dictionaries for simple words, 
description of the morphology, dictionaries for compounds, dictionaries for frozen 
expressions, local grammars used to tag and disambiguate texts) as well as the 
technology used to handle this data (Finite State Automata and Transducers) 
corresponds to a full year course for graduate students. There are a plethora of 
projects that remain to be proposed to students, as the description of natural 
languages is nowhere near complete! 

INTEX is also used to teach French as a second-language; it allows teach- 
ers to ask students to locate morpho-syntactic patterns in wide texts (such as 
5 years of the newspaper Le Monde), to build local grammars for semi- frozen 
expressions (e.g. how to express a date in French) and to ’play’ (edit, correct, 
apply and test) with some linguistic descriptions. 
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